AutozLand

Autoz's Learning Blogs


  • Home

  • Archives7

  • Tags3

  • Categories2

  • About

Untitled

Posted on 2018-08-08 | Edited on 2018-08-07
Symbols count in article: 4.7k | Reading time ≈ 4 mins.
Practical Machine Learning Course Notes

Practical Machine Learning Course Notes

Xing Su

2018-08-07

  • 0.1 Prediction
    • 0.1.1 In Sample vs Out of Sample Errors
  • 0.2 Prediction Study Design
    • 0.2.1 Sample Division Guidelines for Prediction Study Design
    • 0.2.2 Picking the Right Data
  • 0.3 Types of Errors
    • 0.3.1 Notable Measurements for Error – Binary Variables
    • 0.3.2 Notable Measurements for Error – Continuous Variables
  • 0.4 Receiver Operating Characteristic Curves
  • 0.5 Cross Validation
    • 0.5.1 Random Subsampling
    • 0.5.2 K-Fold
    • 0.5.3 Leave One Out
  • 0.6 caret Package (tutorial)
    • 0.6.1 Data Slicing
    • 0.6.2 Training Options (tutorial)
    • 0.6.3 Plotting Predictors (tutorial)
    • 0.6.4 Preprocessing (tutorial)
  • 0.7 Covariate Creation/Feature Extraction
    • 0.7.1 Creating Dummy Variables
    • 0.7.2 Removing Zero Covariates
    • 0.7.3 Creating Splines (Polynomial Functions)
    • 0.7.4 Multicore Parallel Processing
  • 0.8 Preprocessing with Principal Component Analysis (PCA)
    • 0.8.1 prcomp Function
    • 0.8.2 caret Package
  • 0.9 Predicting with Regression
    • 0.9.1 R Commands and Examples
  • 0.10 Prediction with Trees
    • 0.10.1 Process
    • 0.10.2 Measures of Impurity (Reference)
    • 0.10.3 Constructing Trees with caret Package
  • 0.11 Bagging
    • 0.11.1 Bagging Algorithms
  • 0.12 Random Forest
    • 0.12.1 R Commands and Examples
  • 0.13 Boosting
    • 0.13.1 R Commands and Examples
  • 0.14 Model Based Prediction
    • 0.14.1 Linear Discriminant Analysis
    • 0.14.2 Naive Bayes
    • 0.14.3 Compare Results for LDA and Naive Bayes
  • 0.15 Model Selection
    • 0.15.1 Example: Training vs Test Error for Combination of Predictors
    • 0.15.2 Split Samples
    • 0.15.3 Decompose Expected Prediction Error
    • 0.15.4 Hard Thresholding
    • 0.15.5 Regularized Regression Concept (Resource)
    • 0.15.6 Regularized Regression - Ridge Regression
    • 0.15.7 Regularized Regression - LASSO Regression
  • 0.16 Combining Predictors
    • 0.16.1 Example - Majority Vote
    • 0.16.2 Example - Model Ensembling
  • 0.17 Forecasting
    • 0.17.1 R Commands and Examples
  • 0.18 Unsupervised Prediction
    • 0.18.1 R Commands and Examples

\(\pagebreak\)

0.1 Prediction

  • process for prediction = population \(\rightarrow\) probability and sampling to pick set of data \(\rightarrow\) split into training and test set \(\rightarrow\) build prediction function \(\rightarrow\) predict for new data \(\rightarrow\) evaluate
    • Note: choosing the right dataset and knowing what the specific question is are paramount to the success of the prediction algorithm (GoogleFlu failed to predict accurately when people’s search habits changed)

  • components of predictor = question \(\rightarrow\) input data \(\rightarrow\) features (extracting variables/characteristics) \(\rightarrow\) algorithm \(\rightarrow\) parameters (estimate) \(\rightarrow\) evaluation
  • relative order of importance = question (concrete/specific) > data (relevant) > features (properly extract) > algorithms
  • data selection
    • Note: “garbage in = garbage out” \(\rightarrow\) having the correct/relevant data will decide whether the model is successful
    • data for what you are trying to predict is most helpful
    • more data \(\rightarrow\) better models (usually)
  • feature selection
    • good features \(\rightarrow\) lead to data compression, retain relevant information, created based on expert domain knowledge
    • common mistakes \(\rightarrow\) automated feature selection (can yield good results but likely to behave inconsistently with slightly different data), not understanding/dealing with skewed data/outliers, throwing away information unnecessarily
  • algorithm selection
    • matter less than one would expect
    • getting a sensible approach/algorithm will be the basis for a successful prediction
    • more complex algorithms can yield incremental improvements
    • ideally interpretable (simple to explain), accurate, scalable/fast (may leverage parallel computation)
  • prediction is effectively about trade-offs
    • find the correct balance between interpretability vs accuracy vs speed vs simplicity vs scalability
    • interpretability is especially important in conveying how features are used to predict outcome
    • scalability is important because for an algorithm to be of practical use, it needs to be implementable on large datasets without incurring large costs (computational complexity/time)

0.1.1 In Sample vs Out of Sample Errors

  • in sample error = error resulted from applying your prediction algorithm to the dataset you built it with
    • also known as resubstitution error
    • often optimistic (less than on a new sample) as the model may be tuned to error of the sample
  • out of sample error = error resulted from applying your prediction algorithm to a new data set
    • also known as generalization error
    • out of sample error most important as it better evaluates how the model should perform
  • in sample error < out of sample error
    • reason is over-fitting: model too adapted/optimized for the initial dataset
      • data have two parts: signal vs noise
      • goal of predictor (should be simple/robust) = find signal
      • it is possible to design an accurate in-sample predictor, but it captures both signal and noise
      • predictor won’t perform as well on new sample
    • often times it is better to give up a little accuracy for more robustness when predicting on new data
  • example
# load data
library(kernlab); data(spam);
## Warning: package 'kernlab' was built under R version 3.4.4
# picking a small subset (10 values) from spam data set
smallSpam <- spam[sample(dim(spam)[1],size=10),]
# label spam = 2 and ham = 1
spamLabel <- (smallSpam$type=="spam")*1 + 1
# plot the capitalAve values for the dataset with colors differentiated by spam/ham (2 vs 1)
plot(smallSpam$capitalAve,col=spamLabel)

# first rule (over-fitting to capture all variation)
rule1 <- function(x){
    prediction <- rep(NA,length(x))
    prediction[x > 2.7] <- "spam"
    prediction[x < 2.40] <- "nonspam"
    prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
    prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
    return(prediction)
}
# tabulate results of prediction algorithm 1 (in sample error -> no error in this case)
table(rule1(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    2
##   spam          0    3
# second rule (simple, setting a threshold)
rule2 <- function(x){
    prediction <- rep(NA,length(x))
    prediction[x > 2.8] <- "spam"
    prediction[x <= 2.8] <- "nonspam"
    return(prediction)
}
# tabulate results of prediction algorithm 2(in sample error -> 10% in this case)
table(rule2(smallSpam$capitalAve),smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    2
##   spam          0    3
# tabulate out of sample error for algorithm 1
table(rule1(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2141  588
##   spam        647 1225
# tabulate out of sample error for algorithm 2
table(rule2(spam$capitalAve),spam$type)
##          
##           nonspam spam
##   nonspam    2224  642
##   spam        564 1171
# accuracy and total correct for algorithm 1 and 2
rbind("Rule 1" = c(Accuracy = mean(rule1(spam$capitalAve)==spam$type),
    "Total Correct" = sum(rule1(spam$capitalAve)==spam$type)),
    "Rule 2" = c(Accuracy = mean(rule2(spam$capitalAve)==spam$type),
    "Total Correct" = sum(rule2(spam$capitalAve)==spam$type)))
##         Accuracy Total Correct
## Rule 1 0.7315801          3366
## Rule 2 0.7378831          3395

\(\pagebreak\)

0.2 Prediction Study Design

  • procedures
    1. define error rate (type I/type II)
    2. split data into:
      • training, testing, validation (optional)
    3. pick features from the training set
      • use cross-validation
    4. pick prediction function (model) on the training set
      • use cross-validation
    5. if no validation set
      • apply 1 time to test set
    6. if there is a validation set
      • apply to test set and refine
      • apply 1 time to validation
    • Note: it’s important to hold out an untouched sample to accurately estimate the out of sample error rate
  • benchmarks (i.e. set all variables = 0) can help pinpoint/test the model to see what is wrong with the model
  • avoid small sample sizes
    • consider binary outcomes (i.e. coin flip)
    • for n = 1, the probability of perfect classification (100% accuracy) is 50%
    • for n = 10, the probability of perfect classification (100% accuracy) is 0.1%
    • so it’s important to have bigger samples so that when you do get a high accuracy, it may actually be a significant result and not just by chance
  • example: Netflix rating prediction competition
    • split data between training and “held-out”
      • held-out included probe, quiz and test sets
      • probe is used to test the predictor built from the training dataset
      • quiz is used to realistically evaluate out of sample error rates
      • test is used to finally evaluate the validity of algorithm
    • important to not tune model to quiz set specifically

0.2.1 Sample Division Guidelines for Prediction Study Design

  • for large sample sizes
    • 60% training
    • 20% test
    • 20% validation
  • for medium sample sizes
    • 60% training
    • 40% test
    • no validation set to refine model (to ensure test set is of sufficient size)
  • for small sample sizes
    • carefully consider if there are enough sample to build a prediction algorithm
    • no test/validation sets
    • perform cross validation
    • report caveat of small sample size and highlight the fact that the prediction algorithm has never been tested for out of sample error
  • there should always be a test/validation set that is held away and should NOT be looked at when building model
    • when complete, apply the model to the held-out set only one time
  • randomly sample training and test sets
    • for data collected over time, build training set in chunks of times
  • datasets must reflect structure of problem
    • if prediction evolves with time, split train/test sets in time chunks (known as backtesting in finance)
  • subsets of data should reflect as much diversity as possible

0.2.2 Picking the Right Data

  • use like data to predict like
  • to predict a variable/process X, use the data that’s as closely related to X as possible
  • weighting the data/variables by understanding and intuition can help to improve accuracy of prediction
  • data properties matter \(\rightarrow\) knowing how the data connects to what you are trying to measure
  • predicting on unrelated data is the most common mistake
    • if unrelated data must be used, be careful about interpreting the model as to why it works/doesn’t work

\(\pagebreak\)

0.3 Types of Errors

  • when discussing the outcome decided on by the algorithm, Positive = identified and negative = rejected
    • True positive = correctly identified (predicted true when true)
    • False positive = incorrectly identified (predicted true when false)
    • True negative = correctly rejected (predicted false when false)
    • False negative = incorrectly rejected (predicted false when true)
  • example: medical testing
    • True positive = Sick people correctly diagnosed as sick
    • False positive = Healthy people incorrectly identified as sick
    • True negative = Healthy people correctly identified as healthy
    • False negative = Sick people incorrectly identified as healthy

0.3.1 Notable Measurements for Error – Binary Variables

  • accuracy = weights false positives/negatives equally
  • concordance = for multi-class cases, \[\kappa = \frac{accuracy - P(e)}{1 - P(e)}\] where \[P(e) = \frac{TP+FP}{total} \times \frac{TP+FN}{total} + \frac{TN+FN}{total} \times \frac{FP+TN}{total}\]

  • example
    • given that a disease has 0.1% prevalence in the population, we want to know what’s probability of a person having the disease given the test result is positive? the test kit for the disease is 99% sensitive (most positives = disease) and 99% specific (most negatives = no disease)
    • what about 10% prevalence?

0.3.2 Notable Measurements for Error – Continuous Variables

  • Mean squared error (MSE) = \[\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2\]
  • Root mean squared error (RMSE) - \[\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}\]
    • in the same units as variable
    • most commonly used error measure for continuous data
    • is not an effective measures when there are outliers
      • one large value may significantly raise the RMSE
  • median absolute deviation = \[median(|Prediction_i - Truth_i|)\]

\(\pagebreak\)

0.4 Receiver Operating Characteristic Curves

  • are commonly used techniques to measure the quality of a prediction algorithm.
  • predictions for binary classification often are quantitative (i.e. probability, scale of 1 to 10)
    • different cutoffs/threshold of classification (> 0.8 \(\rightarrow\) one outcome) yield different results/predictions
    • Receiver Operating Characteristic curves are generated to compare the different outcomes

  • ROC Curves
    • x-axis = 1 - specificity (or, probability of false positive)
    • y-axis = sensitivity (or, probability of true positive)
    • points plotted = cutoff/combination
    • areas under curve = quantifies whether the prediction model is viable or not
      • higher area \(\rightarrow\) better predictor
      • area = 0.5 \(\rightarrow\) effectively random guessing (diagonal line in the ROC curve)
      • area = 1 \(\rightarrow\) perfect classifier
      • area = 0.8 \(\rightarrow\) considered good for a prediction algorithm

  • example
    • each point on the graph corresponds with a specificity and sensitivity

\(\pagebreak\)

0.5 Cross Validation

  • procedures
    1. split training set into sub-training/test sets
    2. build model on sub-training set
    3. evaluate on sub-test set
    4. repeat and average estimated errors
  • result
    • we are able to fit/test various different models with different variables included to the find the best one on the cross-validated test sets
    • we are able to test out different types of prediction algorithms to use and pick the best performing one
    • we are able to choose the parameters in prediction function and estimate their values
    • Note: original test set completely untouched, so when final prediction algorithm is applied, the result will be an unbiased measurement of the out of sample accuracy of the model
  • approaches
    • random subsampling
    • K-fold
    • leave one out
  • considerations
    • for time series data data must be used in “chunks”
      • one time period might depending all time periods previously (should not take random samples)
    • if you cross-validate to pick predictors, the out of sample error rate may not be the most accurate and thus the errors should still be measured on independent data

0.5.1 Random Subsampling

  • a randomly sampled test set is subsetted out from the original training set
  • the predictor is built on the remaining training data and applied to the test set
  • the above are three random subsamplings from the same training set
  • considerations
    • must be done without replacement
    • random sampling with replacement = bootstrap
      • underestimates of the error, since the if we get one right and the sample appears more than once we’ll get the other right
      • can be corrected with the (0.632 Bootstrap), but it is complicated

0.5.2 K-Fold

  • break training set into \(K\) subsets (above is a 3-fold cross validation)
  • build the model/predictor on the remaining training data in each subset and applied to the test subset
  • rebuild the data \(K\) times with the training and test subsets and average the findings
  • considerations
    • larger k = less bias, more variance
    • smaller k = more bias, less variance

0.5.3 Leave One Out

  • leave out exactly one sample and build predictor on the rest of training data
  • predict value for the left out sample
  • repeat for each sample

\(\pagebreak\)

0.6 caret Package (tutorial)

  • core functionality
    • preprocessing/cleaning data \(\rightarrow\) preProcess()
    • cross validation/data splitting \(\rightarrow\) createDataPartition(), createResample(), createTimeSlices()
    • train algorithms on training data and apply to test sets \(\rightarrow\) train(), predict()
    • model comparison (evaluate the accuracy of model on new data) \(\rightarrow\) confusionMatrix()
  • machine learning algorithms in caret package
    • linear discriminant analysis
    • regression
    • naive Bayes
    • support vector machines
    • classification and regression trees
    • random forests
    • boosting
    • many others
  • caret provides uniform framework to build/predict using different models
    • create objects of different classes for different algorithms, and caret package allows algorithms to be run the same way through predict() function

0.6.1 Data Slicing

  • createDataPartition(y=data$var, times=1, p=0.75, list=FALSE) \(\rightarrow\) creates data partitions using given variable
    • y=data$var = specifies what outcome/variable to split the data on
    • times=1 = specifies number of partitions to create (number of data splitting performed)
    • p=0.75 = percent of data that will be for training the model
    • list=FALSE = returns a matrix of indices corresponding to p% of the data (training set)
      • Note: matrix is easier to subset the data with, so list = FALSE is generally what is used
      • list=TRUE = returns a list of indices corresponding to p% of the data (training set)
    • the function effectively returns a list of indexes of the training set which can then be leveraged to subset the data
      • training<-data[inTrain, ] = subsets the data to training set only
      • testing<-data[-inTrain, ] = the rest of the data set can then be stored as the test set
    • example
# load packages and data
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Warning: package 'lattice' was built under R version 3.4.4
## Warning: package 'ggplot2' was built under R version 3.4.4
# create training set indexes with 75% of data
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
# subset spam data to training
training <- spam[inTrain,]
# subset spam data (the rest) to test
testing <- spam[-inTrain,]
# dimension of original and training dataset
rbind("original dataset" = dim(spam),"training set" = dim(training))
##                  [,1] [,2]
## original dataset 4601   58
## training set     3451   58
  • createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE) = slices the data in to \(k\) folds for cross validation and returns \(k\) lists of indices
    • y=data$var = specifies what outcome/variable to split the data on
    • k=10 = specifies number of folds to create (See K Fold Cross Validation)
      • each training set has approximately has \(\frac{k-1}{k} \%\) of the data (in this case 90%)
      • each training set has approximately has \(\frac{1}{k} \%\) of the data (in this case 10%)
    • list=TRUE = returns k list of indices that corresponds to the cross-validated sets
      • Note: the returned list conveniently splits the data into k datasets/vectors of indices, so list=TRUE is generally what is used
      • when the returned object is a list (called folds in the case), you can use folds[[1]][1:10] to access different elements from that list
      • list=FALSE = returns a vector indicating which of the k folds each data point belongs to (i.e. 1 - 10 is assigned for each of the data points in this case)
        • Note: these group values corresponds to test sets for each cross validation, which means everything else besides the marked points should be used for training
    • [only works when list=T] returnTrain=TRUE = returns the indices of the training sets
      • [default value when unspecified]returnTrain=FALSE = returns indices of the test sets
    • example
# create 10 folds for cross validation and return the training set indices
folds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE)
# structure of the training set indices
str(folds)
## List of 10
##  $ Fold01: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold02: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold03: int [1:4141] 1 2 4 5 6 7 9 10 11 12 ...
##  $ Fold04: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold05: int [1:4141] 1 2 3 4 6 7 8 9 10 11 ...
##  $ Fold06: int [1:4141] 2 3 4 5 7 8 10 11 12 13 ...
##  $ Fold07: int [1:4140] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold08: int [1:4141] 1 3 5 6 7 8 9 10 11 12 ...
##  $ Fold09: int [1:4142] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Fold10: int [1:4140] 1 2 3 4 5 6 8 9 11 12 ...
# return the test set indices instead
# note: returnTrain = FALSE is unnecessary as it is the default behavior
folds.test <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=FALSE)
str(folds.test)
## List of 10
##  $ Fold01: int [1:461] 21 43 46 87 97 99 102 128 133 157 ...
##  $ Fold02: int [1:459] 1 4 6 11 14 22 34 35 40 48 ...
##  $ Fold03: int [1:461] 15 19 39 44 51 55 59 63 70 71 ...
##  $ Fold04: int [1:459] 7 29 42 52 56 62 64 67 76 91 ...
##  $ Fold05: int [1:460] 10 26 28 38 47 83 101 115 136 143 ...
##  $ Fold06: int [1:460] 9 25 31 36 37 54 61 65 74 75 ...
##  $ Fold07: int [1:460] 12 13 30 41 53 89 96 112 122 138 ...
##  $ Fold08: int [1:460] 2 5 16 33 45 49 58 66 77 82 ...
##  $ Fold09: int [1:460] 8 18 20 23 32 57 60 79 86 110 ...
##  $ Fold10: int [1:461] 3 17 24 27 72 80 90 95 104 120 ...
# return first 10 elements of the first training set
folds[[1]][1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10
  • createResample(y=data$var, times=10, list=TRUE) = create 10 resamplings from the given data with replacement
    • list=TRUE = returns list of n vectors that contain indices of the sample
      • Note: each of the vectors is of length of the data, and contains indices
    • times=10 = number of samples to create
# create 10 resamples
resamples <- createResample(y=spam$type,times=10,list=TRUE)
# structure of the resamples (note some samples are repeated)
str(resamples)
## List of 10
##  $ Resample01: int [1:4601] 1 2 4 5 6 9 9 10 11 13 ...
##  $ Resample02: int [1:4601] 1 3 4 4 4 5 5 6 6 7 ...
##  $ Resample03: int [1:4601] 1 2 2 4 5 5 6 7 7 7 ...
##  $ Resample04: int [1:4601] 1 2 2 3 5 6 6 7 9 10 ...
##  $ Resample05: int [1:4601] 1 1 1 4 5 5 6 6 7 9 ...
##  $ Resample06: int [1:4601] 3 3 3 3 3 4 4 4 6 7 ...
##  $ Resample07: int [1:4601] 2 3 4 4 5 6 7 8 8 8 ...
##  $ Resample08: int [1:4601] 1 1 1 4 6 7 8 10 12 12 ...
##  $ Resample09: int [1:4601] 1 2 2 4 6 6 7 7 8 12 ...
##  $ Resample10: int [1:4601] 2 2 3 5 5 5 5 6 6 7 ...
  • createTimeSlices(y=data, initialWindow=20, horizon=10) = creates training sets with specified window length and the corresponding test sets
    • initialWindow=20 = number of consecutive values in each time slice/training set (i.e. values 1 - 20)
    • horizon=10 = number of consecutive values in each predict/test set (i.e. values 21 - 30)
    • fixedWindow=FALSE = training sets always start at the first observation
      • this means that the first training set would be 1 - 20, the second will be 1 - 21, third 1 - 22, etc.
      • but the test sets are still like before (21 - 30, 22 - 31, etc.)
# create time series data
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
# name of lists
names(folds)
## [1] "train" "test"
# first training set
folds$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# first test set
folds$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30

0.6.2 Training Options (tutorial)

  • train(y ~ x, data=df, method="glm") = function to apply the machine learning algorithm to construct model from training data
# returns the arguments of the default train function
# args(train.default)
  • train function has a large set of parameters, below are the default options
    • method="rf" = default algorithm is random forest for training a given data set; caret contains a large number of algorithms
      • names(getModelInfo()) = returns all the options for method argument
      • list of models and their information can be found here
    • preProcess=NULL = set preprocess options (see Preprocessing)
    • weights=NULL = can be used to add weights to observations, useful for unbalanced distribution (a lot more of one type than another)
    • metric=ifelse(is.factor(y), "Accuracy", "RMSE") = default metric for algorithm is Accuracy for factor variables, and RMSE, or root mean squared error, for continuous variables
      • Kappa = measure of concordance (see Notable Measurements for Error – Binary Variables)
      • RSquared can also be used here as a metric, which represents R2 from regression models (only useful for linear models)
    • maximize=ifelse(metric=="RMSE", FALSE, TRUE) = the algorithm should maximize accuracy and minimize RMSE
    • trControl=trainControl() = training controls for the model, more details below
    • tuneGrid=NULL
    • tuneLength=3
# returns the default arguments for the trainControl object
args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method), 
##     10, 25), repeats = ifelse(grepl("[d_]cv$", method), 1, NA), 
##     p = 0.75, search = "grid", initialWindow = NULL, horizon = 1, 
##     fixedWindow = TRUE, skip = 0, verboseIter = FALSE, returnData = TRUE, 
##     returnResamp = "final", savePredictions = FALSE, classProbs = FALSE, 
##     summaryFunction = defaultSummary, selectionFunction = "best", 
##     preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, 
##         freqCut = 95/5, uniqueCut = 10, cutoff = 0.9), sampling = NULL, 
##     index = NULL, indexOut = NULL, indexFinal = NULL, timingSamps = 0, 
##     predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5, 
##         alpha = 0.05, method = "gls", complete = TRUE), trim = FALSE, 
##     allowParallel = TRUE) 
## NULL
  • trainControl creates an object that sets many options for how the model will be applied to the training data
    • Note: the default values are listed below but you can use them to set the parameters to your discretion
    • method="boot" =
      • "boot" = bootstrapping (drawing with replacement)
      • "boot632" = bootstrapping with adjustment
      • "cv" = cross validation
      • "repeatedcv" = repeated cross validation
      • "LOOCV" = leave one out cross validation
    • number=ifelse(grepl("cv", method),10, 25) = number of subsamples to take
      • number=10 = default for any kind of cross validation
      • number=25 = default for bootstrapping
      • Note: number should be increased when fine-tuning model with large number of parameter
    • repeats=ifelse(grepl("cv", method), 1, number) = numbers of times to repeat the subsampling
      • repeats=1 = default for any cross validation method
      • repeats=25 = default for bootstrapping
    • p=0.75 = default percentage of data to create training sets
    • initialWindow=NULL, horizon=1, fixedWindow=TRUE = parameters for time series data
    • verboseIter=FALSE = print the training logs
    • returnData=TRUE, returnResamp = “final”,
    • savePredictions=FALSE = save the predictions for each resample
    • classProbs=FALSE = return classification probabilities along with the predictions
    • summaryFunction=defaultSummary = default summary of the model,
    • preProcOptions=list(thresh = 0.95, ICAcomp = 3, k = 5) = specifies preprocessing options for the model
    • predictionBounds=rep(FALSE, 2) = specify the range of the predicted value
      • for numeric predictions, predictionBounds=c(10, NA) would mean that any value lower than 10 would be treated as 10 and no upper bounds
    • seeds=NA = set the seed for the operation
      • Note: setting this is important when you want to reproduce the same results when the train function is run
    • allowParallel=TRUE = sets for parallel processing/computations

0.6.3 Plotting Predictors (tutorial)

  • it is important to only plot the data in the training set
    • using the test data may lead to over-fitting (model should not be adjusted to test set)
  • goal of producing these exploratory plots = look for potential outliers, skewness, imbalances in outcome/predictors, and explainable groups of points/patterns
  • featurePlot(x=predictors, y=outcome, plot="pairs") = short cut to plot the relationships between the predictors and outcomes
# load relevant libraries
library(ISLR); library(ggplot2);
## Warning: package 'ISLR' was built under R version 3.4.3
# load wage data
data(Wage)
# create training and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
# plot relationships between the predictors and outcome
featurePlot(x=training[,c("age","education","jobclass")], y = training$wage,plot="pairs")

  • qplot(age, wage, color=eduction, data=training) = can be used to separate data by a factor variable (by coloring the points differently)
    • geom_smooth(method = "lm") = adds a regression line to the plots
    • geom=c("boxplot", "jitter") = specifies what kind of plot to produce, in this case both the boxplot and the point cloud
# qplot plus linear regression lines
qplot(age,wage,colour=education,data=training)+geom_smooth(method='lm',formula=y~x)

  • cut2(variable, g=3) = creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles
    • Note: cut2 function is part of the Hmisc package, so library(Hmisc) must be run first
    • this variable can then be used to tabulate/plot the data
  • grid.arrange(p1, p2, ncol=2) = ggplot2 function the print multiple graphs on the same plot
    • Note: grid.arrange function is part of the gridExtra package, so library(gridExtra) must be run first
# load Hmisc and gridExtra packages
library(Hmisc);library(gridExtra);
## Warning: package 'Hmisc' was built under R version 3.4.3
## Warning: package 'survival' was built under R version 3.4.4
## Warning: package 'Formula' was built under R version 3.4.4
## Warning: package 'gridExtra' was built under R version 3.4.2
# cute the wage variable
cutWage <- cut2(training$wage,g=3)
# plot the boxplot
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
      geom=c("boxplot"))
# plot boxplot and point clusters
p2 <- qplot(cutWage,age, data=training,fill=cutWage,
      geom=c("boxplot","jitter"))
# plot the two graphs side by side
grid.arrange(p1,p2,ncol=2)

  • table(cutVariable, data$var2) = tabulates the cut factor variable vs another variable in the dataset (ie; builds a contingency table using cross-classifying factors)
  • prop.table(table, margin=1) = converts a table to a proportion table
    • margin=1 = calculate the proportions based on the rows
    • margin=2 = calculate the proportions based on the columns
# tabulate the cutWage and jobclass variables
t <- table(cutWage,training$jobclass)
# print table
t
##              
## cutWage       1. Industrial 2. Information
##   [ 20.9, 93)           442            271
##   [ 93.0,119)           364            349
##   [118.9,318]           274            402
# convert to proportion table based on the rows
prop.table(t,1)
##              
## cutWage       1. Industrial 2. Information
##   [ 20.9, 93)     0.6199158      0.3800842
##   [ 93.0,119)     0.5105189      0.4894811
##   [118.9,318]     0.4053254      0.5946746
  • qplot(var1, color=var2, data=training, geom="density") = produces density plot for the given numeric and factor variables
    • effectively smoothed out histograms
    • provides for easy overlaying of groups of data
      • break different variables up by group and see how outcomes change between groups
# produce density plot
qplot(wage,colour=education,data=training,geom="density")

0.6.4 Preprocessing (tutorial)

  • some predictors may have strange distributions (i.e. skewed) and may need to be transformed to be more useful for prediction algorithm
    • particularly true for model based algorithms \(\rightarrow\) naive Bayes, linear discriminate analysis, linear regression
  • centering = subtracting the observations of a particular variable by its mean
  • scaling = dividing the observations of a particular variable by its standard deviation
  • normalizing = centering and scaling the variable \(\rightarrow\) effectively converting each observation to the number of standard deviations away from the mean
    • the distribution of the normalized variable will have a mean of 0 and standard deviation of 1
    • Note: normalizing data can help remove bias and high variability, but may not be applicable in all cases
  • Note: if a predictor/variable is standardized when training the model, the same transformations must be performed on the test set with the mean and standard deviation of the train variables
    • this means that the mean and standard deviation of the normalized test variable will NOT be 0 and 1, respectively, but will be close
    • transformations must likely be imperfect but test/train sets must be processed the same way
  • train(y~x, data=training, preProcess=c("center", "scale")) = preprocessing can be directly specified in the train function
    • preProcess=c("center", "scale") = normalize all predictors before constructing model
  • preProcess(trainingData, method=c("center", "scale") = function in the caret to standardize data
    • you can store the result of the preProcess function as an object and apply it to the train and test sets using the predict function
library(kernlab)
# load spam data
data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preProcess object for all predictors ("-58" because 58th = outcome)
preObj <- preProcess(training[,-58],method=c("center","scale"))
# normalize training set
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# normalize test set using training parameters
testCapAveS <- predict(preObj,testing[,-58])$capitalAve
# compare results for capitalAve variable
rbind(train = c(mean = mean(trainCapAveS), std = sd(trainCapAveS)),
    test = c(mean(testCapAveS), sd(testCapAveS)))
##                mean      std
## train -1.976753e-18 1.000000
## test   2.732032e-02 1.495311
  • preprocess(data, method="BoxCox") = applies BoxCox transformations to continuous data to help normalize the variables through maximum likelihood
    • Note: note it assumes continuous values and DOES NOT deal with repeated values
    • qqnorm(processedVar) = can be used to produce the Q-Q plot which compares the theoretical quantiles with the sample quantiles to see the normality of the data
# set up BoxCox transforms
preObj <- preProcess(training[,-58],method=c("BoxCox"))
# perform preprocessing on training data
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# plot histogram and QQ Plot
# Note: the transformation definitely helped to
# normalize the data but it does not produce perfect result
par(mfrow=c(1,2)); hist(trainCapAveS); qqnorm(trainCapAveS)

  • preProcess(data, method="knnImpute") = impute/estimate the missing data using k nearest neighbors (knn) imputation
    • knnImpute = takes the k nearest neighbors from the missing value and averages the value to impute the missing observations
    • Note: most prediction algorithms are not build to handle missing data
# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
# compute differences between imputed values and true values
quantile(capAve - capAveTruth)
##            0%           25%           50%           75%          100% 
## -1.5787898095 -0.0008705889  0.0004638890  0.0010980964  0.2204953814

\(\pagebreak\)

0.7 Covariate Creation/Feature Extraction

  • [level 1]: construct covariate (usable metric, feature) from raw data depends heavily on application
    • ideally we want to summarize data without too much information loss
    • examples
      • text files: frequency of words, frequency of phrases (Google ngrams), frequency of capital letters
      • images: edges, corners, blobs, ridges (computer vision feature detection)
      • webpages: number and type of images, position of elements, colors, videos (A/B Testing)
      • people: height, weight, hair color, sex, country of origin
    • generally, more knowledge and understanding you have of the system/data, the easier it will be to extract the summarizing features
      • when in doubt, more features is always safer \(\rightarrow\) lose less information and the features can be filtered during model construction
    • this process can be automated (i.e. PCA) but generally have to be very careful, as one very useful feature in the training data set may not have as much effect on the test data set
    • Note: science is the key here, Google “feature extraction for [data type]” for more guidance
      • the goal is always to find the salient characteristics that are likely to be different from observation to observation
  • [level 2]: construct new covariates from extracted covariate
    • generally transformations of features you extract from raw data
    • used more for methods like regression and support vector machines (SVM), whose accuracy depend more on the distribution of input variables
    • models like classification trees don’t require as many complex covariates
    • best approach is through exploratory analysis (tables/plots)
    • should only be performed on the train dataset
    • new covariates should be added to data frames under recognizable names so they can be used later
    • preProcess() can be leveraged to handle creating new covariates
    • Note: always be careful about over-fitting

0.7.1 Creating Dummy Variables

  • convert factor variables to indicator/dummy variable \(\rightarrow\) qualitative become quantitative
  • dummyVars(outcome~var, data=training) = creates a dummy variable object that can be used through predict function to create dummy variables
    • predict(dummyObj, newdata=training) = creates appropriate columns to represent the factor variable with appropriate 0s and 1s
      • 2 factor variable \(\rightarrow\) two columns which have 0 or 1 depending on the outcome
      • 3 factor variable \(\rightarrow\) three columns which have 0, 0, and 1 representing the outcome
      • Note: only one of the columns can have values of 1 for each observation
# setting up data
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
# create the dummy variable columns
head(predict(dummies,newdata=training))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 86582                       0                       1
## 161300                      1                       0
## 376662                      0                       1
## 377954                      0                       1
## 228963                      0                       1

0.7.2 Removing Zero Covariates

  • some variables have no variability at all (i.e. variable indicating if an email contained letters)
  • these variables are not useful when we want to construct a prediction model
  • nearZeroVar(training, saveMetrics=TRUE) = returns list of variables in training data set with information on frequency ratios, percent uniques, whether or not it has zero variance
    • freqRatio = ratio of frequencies for the most common value over second most common value
    • percentUnique = percentage of unique data points out of total number of data points
    • zeroVar = TRUE/FALSE indicating whether the predictor has only one distinct value
    • nzv = TRUE/FALSE indicating whether the predictor is a near zero variance predictor
    • Note: when nzv = TRUE, those variables should be thrown out
# print nearZeroVar table
nearZeroVar(training,saveMetrics=TRUE)
##            freqRatio percentUnique zeroVar   nzv
## year        1.115854    0.33301618   FALSE FALSE
## age         1.052632    2.80685062   FALSE FALSE
## maritl      3.178022    0.23786870   FALSE FALSE
## race        8.252381    0.19029496   FALSE FALSE
## education   1.445148    0.23786870   FALSE FALSE
## region      0.000000    0.04757374    TRUE  TRUE
## jobclass    1.042760    0.09514748   FALSE FALSE
## health      2.451560    0.09514748   FALSE FALSE
## health_ins  2.258915    0.09514748   FALSE FALSE
## logwage     1.064103   19.41008563   FALSE FALSE
## wage        1.064103   19.41008563   FALSE FALSE

0.7.3 Creating Splines (Polynomial Functions)

  • when you want to fit curves through the data, basis functions can be leveraged
  • [splines package] bs(data$var, df=3) = creates 3 new columns corresponding to the var, var2, and var3 terms
  • ns() and poly() can also be used to generate polynomials
  • gam() function can also be used and it allows for smoothing of multiple variables with different values for each variable
  • Note: the same polynomial operations must be performed on the test sets using the predict function
# load splines package
library(splines)
# create polynomial function
bsBasis <- bs(training$age,df=3)
# fit the outcome on the three polynomial terms
lm1 <- lm(wage ~ bsBasis,data=training)
# plot all age vs wage data
plot(training$age,training$wage,pch=19,cex=0.5)
# plot the fitted polynomial function
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)

# predict on test values
head(predict(bsBasis,age=testing$age))
##              1          2           3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4163380 0.32117502 0.082587862
## [4,] 0.3063341 0.42415495 0.195763821
## [5,] 0.3776308 0.09063140 0.007250512
## [6,] 0.4403553 0.25969672 0.051051492

0.7.4 Multicore Parallel Processing

  • many of the algorithms in the caret package are computationally intensive
  • since most of the modern machines have multiple cores on their CPUs, it is often wise to enable multicore parallel processing to expedite the computations
  • doMC package is recommended to be used for caret computations (reference)
    • doMC::registerDoMC(cores=4) = registers 4 cores for R to utilize
    • the number of cores you should specify depends on the CPU on your computer (system information usually contains the number of cores)
      • it’s also possible to find the number of cores by directly searching for your CPU model number on the Internet
    • Note: once registered, you should see in your task manager/activity monitor that 4 “R Session” appear when you run your code

\(\pagebreak\)

0.8 Preprocessing with Principal Component Analysis (PCA)

  • constructing a prediction model may not require every predictor
  • ideally we want to capture the most variation with the least amount of variables
    • weighted combination of predictors may improve fit
    • combination needs to capture the most information
  • PCA is suited to do this and will help reduce number of predictors as well as reduce noise (due to averaging)
    • statistical goal = find new set of multivariate variables that are uncorrelated and explain as much variance as possible
    • data compression goal = find the best matrix created with fewer variables that explains the original data
    • PCA is most useful for linear-type models (GLM, LDA)
    • generally more difficult to interpret the predictors (complex weighted sums of variables)
    • Note: outliers are can be detrimental to PCA as they may represent a lot of variation in data
      • exploratory analysis (plots/tables) should be used to identify problems with the predictors
      • transformations with log/BoxCox may be helpful

0.8.1 prcomp Function

  • pr<-prcomp(data) = performs PCA on all variables and returns a prcomp object that contains information about standard deviations and rotations
    • pr$rotations = returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) \(\rightarrow\) how the principal components are created
    • often times, it is useful to take the log transformation of the variables and adding 1 before performing PCA
      • helps to reduce skewness or strange distribution in data
      • log(0) = - infinity, so we add 1 to account for zero values
      • makes data more Gaussian
    • plot(pr) = plots the percent variation explained by the first 10 principal components (PC)
      • can be used to find the PCs that represent the most variation
# load  spam data
data(spam)
# perform PCA on dataset
prComp <- prcomp(log10(spam[,-58]+1))
# print out the eigenvector/rotations first 5 rows and PCs
head(prComp$rotation[, 1:5], 5)
##                 PC1           PC2         PC3         PC4          PC5
## make    0.019370409  0.0427855959 -0.01631961  0.02798232 -0.014903314
## address 0.010827343  0.0408943785  0.07074906 -0.01407049  0.037237531
## all     0.040923168  0.0825569578 -0.03603222  0.04563653  0.001222215
## num3d   0.006486834 -0.0001333549  0.01234374 -0.01005991 -0.001282330
## our     0.036963221  0.0941456085 -0.01871090  0.05098463 -0.010582039
# create new variable that marks spam as 2 and nospam as 1
typeColor <- ((spam$type=="spam")*1 + 1)
# plot the first two principal components
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")

0.8.2 caret Package

  • pp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8)) = perform PCA with preProcess function and returns the number of principal components that can capture the majority of the variation
    • creates a preProcess object that can be applied using predict function
    • pcaComp=2 = specifies the number of principal components to compute (2 in this case)
    • thresh=0.8 = threshold for variation captured by principal components
      • thresh=0.95 = default value, which returns the number of principal components that are needed to capture 95% of the variation in data
  • predict(pp, training) = computes new variables for the PCs (2 in this case) for the training data set
    • the results from predict can then be used as data for the prediction model
    • Note: the same PCA must be performed on the test set
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# run model on outcome and principle components
modelFit <- train(y=training$type,x=trainPC,method="glm")
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     647   50
##    spam         66  387
##                                           
##                Accuracy : 0.8991          
##                  95% CI : (0.8803, 0.9159)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7874          
##  Mcnemar's Test P-Value : 0.1637          
##                                           
##             Sensitivity : 0.9074          
##             Specificity : 0.8856          
##          Pos Pred Value : 0.9283          
##          Neg Pred Value : 0.8543          
##              Prevalence : 0.6200          
##          Detection Rate : 0.5626          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.8965          
##                                           
##        'Positive' Class : nonspam         
## 
  • alternatively, PCA can be directly performed with the train method
    • train(outcome ~ ., method="glm", preProcess="pca", data=training) = performs PCA first on the training set and then runs the specified model
      • effectively the same procedures as above (preProcess \(\rightarrow\) predict)
  • Note: in both cases, the PCs were able to achieve 90+% accuracy
# construct model
#modelFit <- train(training$type ~ .,method="glm",preProcess="pca",data=training)
modelFit <- train(y=training$type,x=training[,-58],method="glm",preProcess="pca")

# print results of model
confusionMatrix(testing$type,predict(modelFit,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     664   33
##    spam         46  407
##                                           
##                Accuracy : 0.9313          
##                  95% CI : (0.9151, 0.9452)
##     No Information Rate : 0.6174          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8554          
##  Mcnemar's Test P-Value : 0.177           
##                                           
##             Sensitivity : 0.9352          
##             Specificity : 0.9250          
##          Pos Pred Value : 0.9527          
##          Neg Pred Value : 0.8985          
##              Prevalence : 0.6174          
##          Detection Rate : 0.5774          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.9301          
##                                           
##        'Positive' Class : nonspam         
## 

\(\pagebreak\)

0.9 Predicting with Regression

  • prediction with regression = fitting regression model (line) to data \(\rightarrow\) multiplying each variable by coefficients to predict outcome
  • useful when the relationship between the variables can be modeled as linear
  • the model is easy to implement and the coefficients are easy to interpret
  • if the relationships are non-linear, the regression model may produce poor results/accuracy
    • Note: linear regressions are generally used in combination with other models
  • model \[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_p X_{pi} + e_i \]
    • where \(\beta_0\) is the intercept (when all variables are 0)
    • \(\beta_1, \beta_2, \ldots, \beta_p\) are the coefficients
    • \(X_{1i}, X_{2i}, \ldots, X_{pi}\) are the variables/covariates
    • \(e_i\) is the error
    • \(Y_i\) is the outcome
  • prediction \[\hat Y_i = \hat \beta_0 + \hat \beta_1 X_{1i} + \hat \beta_2 X_{2i} + \ldots + \hat \beta_p X_{pi}\]
    • where \(\hat \beta_0\) is the estimated intercept (when all variables are 0)
    • \(\hat \beta_1, \hat \beta_2, \ldots, \hat \beta_p\) are the estimated coefficients
    • \(X_{1i}, X_{2i}, \ldots, X_{pi}\) are the variables/covariates
    • \(\hat Y_i\) is the predicted outcome

0.9.1 R Commands and Examples

  • lm<-lm(y ~ x, data=train) = runs a linear model of outcome y on predictor x \(\rightarrow\) univariate regression
    • summary(lm) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values
    • lm(y ~ x1+x2+x3, data=train) = run linear model of outcome y on predictors x1, x2, and x3
    • lm(y ~ ., data=train = run linear model of outcome y on all predictors
  • predict(lm, newdata=df) = use the constructed linear model to predict outcomes (\(\hat Y_i\)) for the new values
    • newdata data frame must have the same variables (factors must have the same levels) as the training data
    • newdata=test = predict outcomes for the test set based on linear regression model from the training
    • Note: the regression line will not be a perfect fit on the test set since it was constructed on the training set
  • RSME can be calculated to measure the accuracy of the linear model
    • Note: \(RSME_{test}\), which estimates the out-of-sample error, is almost always GREATER than \(RSME_{train}\)
# load data
data(faithful)
# create train and test sets
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
# build linear model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
# print summary of linear model
summary(lm1)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18990 -0.33724  0.06439  0.32056  1.22913 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.048931   0.219126   -9.35 2.49e-16 ***
## waiting      0.077458   0.003032   25.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4801 on 135 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.8274 
## F-statistic: 652.8 on 1 and 135 DF,  p-value: < 2.2e-16
# predict eruptions for new waiting time
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
##        1 
## 4.147697
# create 1 x 2 panel plot
par(mfrow=c(1,2))
# plot train data with the regression line
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",
    ylab="Duration", main = "Train")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# plot test data with the regression line
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",
    ylab="Duration", main = "Test")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)

# Calculate RMSE on training and test sets
c(trainRMSE = sqrt(sum((lm1$fitted-trainFaith$eruptions)^2)),
    testRMSE = sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2)))
## trainRMSE  testRMSE 
##  5.578530  6.013711
  • pi<-predict(lm, newdata=test, interval="prediction") = returns 3 columns for fit (predicted value, same as before), lwr (lower bound of prediction interval), and upr (upper bound of prediction interval)
    • matlines(x, pi, type="l") = plots three lines, one for the linear fit and two for upper/lower prediction interval bounds
# calculate prediction interval
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
# plot data points (eruptions, waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
# plot fit line and prediction interval
matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)

  • lm <- train(y ~ x, method="lm", data=train) = run linear model on the training data \(\rightarrow\) identical to lm function
    • summary(lm$finalModel) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values \(\rightarrow\) identical to summary(lm) for a lm object
    • train(y ~ ., method="lm", data=train) = run linear model on all predictors in training data
      • multiple predictors (dummy/indicator variables) are created for factor variables
    • plot(lm$finalModel) = construct 4 diagnostic plots for evaluating the model
      • Note: more information on these plots can be found at ?plot.lm
      • Residuals vs Fitted
      • Normal Q-Q
      • Scale-Location
      • Residuals vs Leverage
# create train and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# fit linear model for age jobclass and education
modFit<- train(wage ~ age + jobclass + education,method = "lm",data=training)
# store final model
finMod <- modFit$finalModel
# set up 2 x 2 panel plot
par(mfrow = c(2, 2))
# construct diagnostic plots for model
plot(finMod,pch=19,cex=0.5,col="#00000010")

  • plotting residuals by fitted values and coloring with a variable not used in the model helps spot a trend in that variable.
# plot fitted values by residuals 
qplot(finMod$fitted, finMod$residuals, color=race, data=training)

  • plotting residuals by index (ie; row numbers) can be helpful in showing missing variables
    • plot(finMod$residuals) = plot the residuals against index (row number)
    • if there’s a trend/pattern in the residuals, it is highly likely that another variable (such as age/time) should be included.
      • residuals should not have relationship to index
# plot residual by index
plot(finMod$residuals,pch=19,cex=0.5)

  • here the residuals increase linearly with the index, and the highest residuals are concentrated in the higher indexes, so there must be a missing variable

\(\pagebreak\)

0.10 Prediction with Trees

  • prediction with trees = iteratively split variables into groups (effectively constructing decision trees) \(\rightarrow\) produces nonlinear model
    • the classification tree uses interactions between variables \(\rightarrow\) the ultimate groups/leafs may depend on many variables
  • the result (tree) is easy to interpret, and generally performs better predictions than regression models when the relationships are non-linear
  • transformations less important \(\rightarrow\) monotone transformations (order unchanged, such as \(\log\)) will produce same splits
  • trees can be used for regression problems as well and use RMSE as measure of impurity
  • however, without proper cross-validation, the model can be over-fitted (especially with large number of variables) and results may be variable from one run to the next
    • it is also harder to estimate the uncertainty of the model
  • party, rpart, tree packages can all build trees

0.10.1 Process

  1. start with all variables in one group
  2. find the variable that best splits the outcomes into two groups
  3. divide data into two groups (leaves) based on the split performed (node)
  4. within each split, find variables to split the groups again
  5. continue this process until all groups are sufficiently small/homogeneous/“pure”

0.10.2 Measures of Impurity (Reference)

\[\hat{p}_{mk} = \frac{\sum_{i}^m \mathbb{1}(y_i = k)}{N_m}\]

  • \(\hat p_mk\) is the probability of the objects in group \(m\) to take on the classification \(k\)
  • \(N_m\) is the size of the group

  • Misclassification Error \[ 1 - \hat{p}_{m~k(m)}\] where \(k(m)\) is the most common classification/group
    • 0 = perfect purity
    • 0.5 = no purity
      • Note: it is not 1 here because when \(\hat{p}_{m~k(m)} < 0.5\) or there’s not predominant classification for the objects, it means the group should be further subdivided until there’s a majority
  • Gini Index\[ \sum_{k \neq k'} \hat{p}_{mk} \times \hat{p}_{mk'} = \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk}) = 1 - \sum_{k=1}^K p_{mk}^2\]
    • 0 = perfect purity
    • 0.5 = no purity
  • Deviance \[ -\sum_{k=1}^K \hat{p}_{mk} \log_e\hat{p}_{mk} \]
    • 0 = perfect purity
    • 1 = no purity
  • Information Gai \[ -\sum_{k=1}^K \hat{p}_{mk} \log_2\hat{p}_{mk} \]
    • 0 = perfect purity
    • 1 = no purity
  • example

# set margin and seed
par(mar=c(1,1,1,1), mfrow = c(1, 2)); set.seed(1234);
# simulate data
x = rep(1:4,each=4); y = rep(1:4,4)
# plot first scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",15),rep("red",1)),pch=19)
# plot second scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",8),rep("red",8)),pch=19)

  • left graph
    • Misclassification: \(\frac{1}{16} = 0.06\)
    • Gini: \(1 - [(\frac{1}{16})^2 + (\frac{15}{16})^2] = 0.12\)
    • Information:\(-[\frac{1}{16} \times \log_2 (\frac{1}{16})+ \frac{15}{16} \times \log_2(\frac{1}{16})] = 0.34\)
  • right graph
    • Misclassification: \(\frac{8}{16} = 0.5\)
    • Gini: \(1 - [(\frac{8}{16})^2 + (\frac{8}{16})^2] = 0.5\)
    • Information:\(-[\frac{8}{16} \times \log_2 (\frac{8}{16})+ \frac{8}{16} \times \log_2(\frac{8}{16})] = 1\)

0.10.3 Constructing Trees with caret Package

  • tree<-train(y ~ ., data=train, method="rpart") = constructs trees based on the outcome and predictors
    • produces an rpart object, which can be used to predict new/test values
    • print(tree$finalModel) = returns text summary of all nodes/splits in the tree constructed
  • plot(tree$finalModel, uniform=TRUE) = plots the classification tree with all nodes/splits
    • [rattle package] fancyRpartPlot(tree$finalModel) = produces more readable, better formatted classification tree diagrams
    • each split will have the condition/node in bold and the splits/leafs on the left and right sides following the “yes” or “no” indicators
      • “yes” \(\rightarrow\) go left
      • “no” \(\rightarrow\) go right
# load iris data set
data(iris)
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.65 34  1 versicolor (0.00000000 0.97058824 0.02941176) *
##     7) Petal.Width>=1.65 36  2 virginica (0.00000000 0.05555556 0.94444444) *
# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)
## Warning: Bad 'data' field in model 'call'.
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

# predict on test values
predict(modFit,newdata=testing)
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] versicolor virginica  virginica  versicolor versicolor virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

0.11 Bagging

  • bagging = bootstrap aggregating
    • resample training data set (with replacement) and recalculate predictions
    • average the predictions together or majority vote
    • more information can be found here
  • averaging multiple complex models have similar bias as each of the models on its own, and reduced variance because of the average
  • most useful for non-linear models

  • example
    • loess(y ~ x, data=train, span=0.2) = fits a smooth curve to data
      • span=0.2 = controls how smooth the curve should be
# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
    # create sample from data with replacement
    ss <- sample(1:dim(ozone)[1],replace=T)
    # draw sample from the dataa and reorder rows based on ozone
    ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
    # fit loess function through data (similar to spline)
    loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
    # prediction from loess curve for the same values each time
    ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)

0.11.1 Bagging Algorithms

  • in the caret package, there are three options for the train function to perform bagging
    • bagEarth - Bagged MARS (documentation)
    • treebag - Bagged CART (documentation)
    • bagFDA - Bagged Flexible Discriminant Analysis (documentation)
  • alternatively, custom bag functions can be constructed (documentation)
    • bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate)) = define and execute custom bagging algorithm
      • B=10 = iterations/resampling to perform
      • bagControl() = controls for how the bagging should be executed
        • fit=ctreeBag$fit = the model ran on each resampling of data
        • predict=ctreeBag$predict = how predictions should be calculated from each model
        • aggregate=ctreeBag$aggregate = how the prediction models should be combined/averaged
  • example
# load relevant package and data
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
                # custom bagging function
                bagControl = bagControl(fit = ctreeBag$fit,
                                        predict = ctreeBag$pred,
                                        aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")

\(\pagebreak\)

0.12 Random Forest

  • random forest = extension of bagging on classification/regression trees
    • one of the most used/accurate algorithms along with boosting

  • process
    • bootstrap samples from training data (with replacement)
    • split and bootstrap variables
    • grow trees (repeat split/bootstrap) and vote/average final trees
  • drawbacks
    • algorithm can be slow (process large number of trees)
    • hard to interpret (large numbers of splits and nodes)
    • over-fitting (difficult to know which tree is causing over-fitting)
    • Note: it is extremely important to use cross validation when running random forest algorithms

0.12.1 R Commands and Examples

  • rf<-train(outcome ~ ., data=train, method="rf", prox=TRUE, ntree=500) = runs random forest algorithm on the training data against all predictors
    • Note: random forest algorithm automatically bootstrap by default, but it is still important to have train/test/validation split to verify the accuracy of the model
    • prox=TRUE = the proximity measures between observations should be calculated (used in functions such as classCenter() to find center of groups)
      • rf$finalModel$prox = returns matrix of proximities
    • ntree=500 = specify number of trees that should be constructed
    • do.trace=TRUE = prints logs as the trees are being built \(\rightarrow\) useful by indicating progress to user
    • Note: randomForest() function can be used to perform random forest algorithm (syntax is the same as train) and is much faster
  • getTree(rf$finalModel, k=2) = return specific tree from random forest model
  • classCenters(predictors, outcome, proximity, nNbr) = return computes the cluster centers using the nNbr nearest neighbors of the observations
    • prox = rf$finalModel$prox = proximity matrix from the random forest model
    • nNbr = number of nearest neighbors that should be used to compute cluster centers
  • predict(rf, test) = apply the random forest model to test data set
    • confusionMatrix(predictions, actualOutcome) = tabulates the predictions of the model against the truths
      • Note: this is generally done for the validation data set using the model built from training
  • example
library(randomForest)
# load data
data(iris)
# create train/test data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# apply random forest
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
# return the second tree (first 6 rows)
head(getTree(modFit$finalModel,k=2))
##   left daughter right daughter split var split point status prediction
## 1             2              3         3        2.45      1          0
## 2             0              0         0        0.00     -1          1
## 3             4              5         3        4.75      1          0
## 4             6              7         4        1.65      1          0
## 5             8              9         3        5.30      1          0
## 6             0              0         0        0.00     -1          2
# compute cluster centers
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
# convert irisP to data frame and add Species column
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
# plot data points
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
# add the cluster centers
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

# predict outcome for test data set using the random forest model
pred <- predict(modFit,testing)
# logic value for whether or not the rf algorithm predicted correctly
testing$predRight <- pred==testing$Species
# tabulate results
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         3
##   virginica       0          0        12
# plot data points with the incorrect classification highlighted
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")

\(\pagebreak\)

0.13 Boosting

  • boosting = one of the most widely used and accurate prediction models, along with random forest
  • boosting can be done with any set of classifiers, and a well-known approach is gradient boosting
  • more detail tutorial can be found here

  • process: take a group of weak predictors \(\rightarrow\) weight them and add them up \(\rightarrow\) result in a stronger predictor
    • start with a set of classifiers \(h_1, \ldots, h_k\)
      • examples: all possible trees, all possible regression models, all possible cutoffs (divide data into different parts)
    • calculate a weighted sum of classifiers as the prediction value \[f(x) = \sum_i \alpha_i h_i(x)\] where \(\alpha_i\) = coefficient/weight and \(h_i(x)\) = value of classifier
      • goal = minimize error (on training set)
      • select one \(h\) at each step (iterative)
      • calculate weights based on errors
      • up-weight missed classifications and select next \(h\)
  • example

  • we start with space with blue + and red - and the goal is to classify all the object correctly
  • only straight lines will be used for classification

  • from the above, we can see that a group of weak predictors (lines in this case), can be combined and weighed to become a much stronger predictor

0.13.1 R Commands and Examples

  • gbm <- train(outcome ~ variables, method="gbm", data=train, verbose=F) = run boosting model on the given data
    • options for method for boosting
      • gbm - boosting with trees
      • mboost - model based boosting
      • ada - statistical boosting based on additive logistic regression
      • gamBoost for boosting generalized additive models
      • Note: differences between packages include the choice of basic classification functions and combination rules
  • predict function can be used to apply the model to test data, similar to the rest of the algorithms in caret package

  • example

library(ISLR)
# load data
data(Wage)
# remove log wage variable (we are trying to predict wage)
Wage <- subset(Wage,select=-c(logwage))
# create train/test data sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# run the gbm model
modFit <- train(wage ~ ., method="gbm",data=training,verbose=FALSE)
# print model summary
print(modFit)
## Stochastic Gradient Boosting 
## 
## 2102 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      35.49016  0.3114599  24.35923
##   1                  100      34.95818  0.3222808  23.95843
##   1                  150      34.90950  0.3236008  23.97720
##   2                   50      34.91530  0.3248413  23.87687
##   2                  100      34.79451  0.3280679  23.83395
##   2                  150      34.90113  0.3251693  23.94022
##   3                   50      34.79871  0.3280576  23.76077
##   3                  100      34.97846  0.3223291  23.92947
##   3                  150      35.19116  0.3163043  24.10556
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

\(\pagebreak\)

0.14 Model Based Prediction

  • model based prediction = assumes the data follow a probabilistic model/distribution and use Bayes’ theorem to identify optimal classifiers/variables
    • can potentially take advantage of structure of the data
    • could help reduce computational complexity (reduce variables)
    • can be reasonably accurate on real problems
  • this approach does make additional assumptions about the data, which can lead to model failure/reduced accuracy if they are too far off
  • goal = build parameter-based model (based on probabilities) for conditional distribution \(P(Y = k~|~X = x)\), or the probability of the outcome \(Y\) is equal to a particular value \(k\) given a specific set of predictor variables \(x\)
    • Note: \(X\) is the data for the model (observations for all predictor variables), which is also known as the design matrix
  • typical approach/process
    1. start with the quantity \(P(Y = k~|~X = x)\)
    2. apply Bayes’ Theorem such that \[ P(Y = k ~|~ X=x) = \frac{P(X=x~|~Y=k)P(Y=k)}{\sum_{\ell=1}^K P(X=x ~|~Y = \ell) P(Y=\ell)}\] where the denominator is simply the sum of probabilities for the predictor variables are the set specified in \(x\) for all outcomes of \(Y\)
    3. assume the term \(P(X=x~|~Y=k)\) in the numerator follows a parameter-based probability distribution, or \(f_k(x)\)
      • common choice = Gaussian distribution \[f_k(x) = \frac{1}{\sigma_k \sqrt{2 \pi}}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}\]
    4. assume the probability for the outcome \(Y\) to take on value of \(k\), or \(P(Y=k)\), is determined from the data to be some known quantity \(\pi_k\)
      • Note: \(P(Y=k)\) is known as the prior probability
    5. so the quantity \(P(Y = k~|~X = x)\) can be rewritten as \[P(Y = k ~|~ X=x) = \frac{f_k(x) \pi_k}{\sum_{\ell = 1}^K f_{\ell}(x) \pi_{\ell}}\]
    6. estimate the parameters (\(\mu_k\), \(\sigma_k^2\)) for the function \(f_k(x)\) from the data
    7. calculate \(P(Y = k~|~X = x)\) using the parameters
    8. the outcome \(Y\) is where the value of \(P(Y = k ~|~ X = x)\) is the highest
  • prediction models that leverage this approach
    • linear discriminant analysis = assumes \(f_k(x)\) is multivariate Gaussian distribution with same covariance for each predictor variables
      • effectively drawing lines through “covariate space”
    • quadratic discriminant analysis = assumes \(f_k(x)\) is multivariate Gaussian distribution with different covariance for predictor variables
      • effectively drawing curves through “covariate space”
    • normal mixture modeling = assumes more complicated covariance matrix for the predictor variables
    • naive Bayes = assumes independence between predictor variables/features for model building (covariance = 0)
      • Note: this may be an incorrect assumption but it helps to reduce computational complexity and may still produce a useful result

0.14.1 Linear Discriminant Analysis

  • to compare the probability for outcome \(Y = k\) versus probability for outcome \(Y = k\), we can look at the ratio of \[\frac{P(Y=k~|~X=x)}{P(Y=j~|~X=x)}\]
  • take the log of the ratio and apply Bayes’ Theorem, we get \[\log \frac{P(Y = k~|~X=x)}{P(Y = j~|~X=x)} = \log \frac{f_k(x)}{f_j(x)} + \log \frac{\pi_k}{\pi_j}\] which is effectively the log ratio of probability density functions plus the log ratio of prior probabilities
    • Note: \(\log\) = monotone transformation, which means taking the \(\log\) of a quantity does not affect implication of the ratio since th \(\log(ratio)\) is directly correlated with ratio
  • if we substitute \(f_k(x)\) and \(f_l(x)\) with Gaussian probability density functions \[f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] so the ratio can be simplified to \[\log \frac{P(Y = k~|~X=x)}{P(Y = j~|~X=x)} = \log \frac{\pi_k}{\pi_j} - \frac{1}{2}(\mu_k + \mu_j)^T \Sigma^{-1}(\mu_k + \mu_j) + x^T \Sigma^{-1} (\mu_k - \mu_j)\] where \(\Sigma^{-1}\) = covariance matrix for the predictor variables, \(x^T\) = set of predictor variables, and \(\mu_k\) / \(\mu_j\) = mean of \(k\), \(j\) respectively
  • as annotated above, the log-ratio is effectively an equation of a line for a set of predictor variables \(x\) \(\rightarrow\) the first two terms are constants and the last term is in the form of \(X \beta\)
    • Note: the lines are also known as decision boundaries
  • therefore, we can classify values based on which side of the line the value is located (\(k\) vs \(j\))
  • discriminant functions are used to determine value of \(k\), the functions are in the form of \[\delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2}\mu_k \Sigma^{-1}\mu_k + log(\mu_k)\]
    • plugging in the set of predictor variables, \(x^T\), into the discriminant function, we can find the value of \(k\) that maximizes the function \(\delta_k(x)\)
    • the terms of the discriminant function can be estimated using maximum likelihood
  • the predicted value for the outcome is therefore \(\hat{Y}(x) = argmax_k \delta_k(x)\)

  • example
    • classify a group of values into 3 groups using 2 variables (\(x\), \(y\) coordinates)
    • 3 lines are draw to split the data into 3 Gaussian distributions
      • each line splits the data into two groups \(\rightarrow\) 1 vs 2, 2 vs 3, 1 vs 3
      • each side of the line represents a region where the probability of one group (1, 2, or 3) is the highest
    • the result is represented in the the following graph

  • R Commands
    • lda<-train(outcome ~ predictors, data=training, method="lda") = constructs a linear discriminant analysis model on the predictors with the provided training data
    • predict(lda, test) = applies the LDA model to test data and return the prediction results in data frame
    • example: caret package
# load data
data(iris)
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# run the linear discriminant analysis on training data
lda <- train(Species ~ .,data=training,method="lda")
# predict test outcomes using LDA model
pred.lda <- predict(lda,testing)
# print results
pred.lda
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  versicolor virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

0.14.2 Naive Bayes

  • for predictors \(X_1,\ldots,X_m\), we want to model \(P(Y = k ~|~ X_1,\ldots,X_m)\)
  • by applying Bayes’ Theorem, we get \[P(Y = k ~|~ X_1,\ldots,X_m) = \frac{\pi_k P(X_1,\ldots,X_m~|~ Y=k)}{\sum_{\ell = 1}^K P(X_1,\ldots,X_m ~|~ Y=k) \pi_{\ell}}\]
  • since the denominator is just a sum (constant), we can rewrite the quantity as \[P(Y = k ~|~ X_1,\ldots,X_m) \propto \pi_k P(X_1,\ldots,X_m~|~ Y=k)\] or the probability is proportional to the numerator
    • Note: maximizing the numerator is the same as maximizing the ratio
  • \(\pi_k P(X_1,\ldots,X_m~|~ Y=k)\) can be rewritten as \[\begin{aligned} \pi_k P(X_1,\ldots,X_m~|~ Y=k) & = \pi_k P(X_1 ~|~ Y = k)P(X_2,\ldots,X_m ~|~ X_1,Y=k) \\ & = \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ X_1, Y=k) P(X_3,\ldots,X_m ~|~ X_1,X_2, Y=k) \\ & = \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ X_1, Y=k)\ldots P(X_m~|~X_1\ldots,X_{m-1},Y=k) \\ \end{aligned}\] where each variable has its own probability term that depends on all the terms before it
    • this is effectively indicating that each of the predictors may be dependent on other predictors
  • however, if we make the assumption that all predictor variables are independent to each other, the quantity can be simplified to \[ \pi_k P(X_1,\ldots,X_m~|~ Y=k) \approx \pi_k P(X_1 ~|~ Y = k) P(X_2 ~|~ Y = k)\ldots P(X_m ~|~,Y=k)\] which is effectively the product of the prior probability for \(k\) and the probability of variables \(X_1,\ldots,X_m\) given that \(Y = k\)
    • Note: the assumption is naive in that it is unlikely the predictors are completely independent of each other, but this model still produces useful results particularly with large number of binary/categorical variables
      • text and document classification usually require large quantities of binary and categorical features
  • R Commands
    • nb <- train(outcome ~ predictors, data=training, method="nb") = constructs a naive Bayes model on the predictors with the provided training data
    • predict(nb, test) = applies the naive Bayes model to test data and return the prediction results in data frame
    • example: caret package
# using the same data from iris, run naive Bayes on training data
nb <- train(Species ~ ., data=training,method="nb")
# predict test outcomes using naive Bayes model
pred.nb <- predict(nb,testing)
# print results
pred.nb
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  versicolor versicolor virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

0.14.3 Compare Results for LDA and Naive Bayes

  • linear discriminant analysis and naive Bayes generally produce similar results for small data sets
  • for our example data from iris data set, we can compare the prediction the results from the two models
# tabulate the prediction results from LDA and naive Bayes
table(pred.lda,pred.nb)
##             pred.nb
## pred.lda     setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         16         0
##   virginica       0          1        13
# create logical variable that returns TRUE for when predictions from the two models match
equalPredictions <- (pred.lda==pred.nb)
# plot the comparison
qplot(Petal.Width,Sepal.Width,colour=equalPredictions,data=testing)

  • as we can see from above, only one data point, which is located inbetween the two classes is predicted differently by the two models

\(\pagebreak\)

0.15 Model Selection

  • the general behavior of the errors of the training and test sets are as follows
    • as the number of predictors used increases (or model complexity), the error for the prediction model on training set always decreases
    • the error for the prediction model on test set decreases first and then increases as number of predictors used approaches the total number of predictors available

  • this is expected since as more predictors used, the model is more likely to overfit the training data
  • goal in selecting models = avoid overfitting on training data and minimize error on test data
  • approaches
    • split samples
    • decompose expected prediction error
    • hard thresholding for high-dimensional data
    • regularization for regression
      • ridge regression
      • lasso regression
  • problems
    • time/computational complexity limitations
    • high dimensional

0.15.1 Example: Training vs Test Error for Combination of Predictors

  • Note: the code for this example comes from Hector Corrada Bravo’s Practical Machine Learning Course
  • to demonstrate the behavior of training and test errors, the prostate dataset from Elements of Statistical Learning is used
  • all combinations of predictors are used to produce prediction models, and Residual Squared Error (RSS) is calculated for all models on both the training and test sets
library(ElemStatLearn)
# load data and set seed
data(prostate); set.seed(1)
# define outcome y and predictors x
covnames <- names(prostate[-(9:10)])
y <- prostate$lpsa; x <- prostate[,covnames]
# create test set predictors and outcomes
train.ind <- sample(nrow(prostate), ceiling(nrow(prostate))/2)
y.test <- prostate$lpsa[-train.ind]; x.test <- x[-train.ind,]
# create training set predictors and outcomes
y <- prostate$lpsa[train.ind]; x <- x[train.ind,]
# p = number of predictors
p <- length(covnames)
# initialize the list of residual sum squares
rss <- list()
# loop through each combination of predictors and build models
for (i in 1:p) {
    # compute matrix for p choose i predictors for i = 1...p (creates i x p matrix)
    Index <- combn(p,i)
    # calculate residual sum squares of each combination of predictors
    rss[[i]] <- apply(Index, 2, function(is) {
        # take each combination (or column of Index matrix) and create formula for regression
        form <- as.formula(paste("y~", paste(covnames[is], collapse="+"), sep=""))
        # run linear regression with combination of predictors on training data
        isfit <- lm(form, data=x)
        # predict outcome for all training data points
        yhat <- predict(isfit)
        # calculate residual sum squares for predictions on training data
        train.rss <- sum((y - yhat)^2)
        # predict outcome for all test data points
        yhat <- predict(isfit, newdata=x.test)
        # calculate residual sum squares for predictions on test data
        test.rss <- sum((y.test - yhat)^2)
        # store each pair of training and test residual sum squares as a list
        c(train.rss, test.rss)
    })
}
# set up plot with labels, title, and proper x and y limits
plot(1:p, 1:p, type="n", ylim=range(unlist(rss)), xlim=c(0,p),
    xlab="Number of Predictors", ylab="Residual Sum of Squares",
    main="Prostate Cancer Data - Training vs Test RSS")
# add data points for training and test residual sum squares
for (i in 1:p) {
    # plot training residual sum squares in blue
    points(rep(i, ncol(rss[[i]])), rss[[i]][1, ], col="blue", cex = 0.5)
    # plot test residual sum squares in red
    points(rep(i, ncol(rss[[i]])), rss[[i]][2, ], col="red", cex = 0.5)
}
# find the minimum training RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[1,]))
# plot line through the minimum training RSS data points in blue
lines((1:p), minrss, col="blue", lwd=1.7)
# find the minimum test RSS for each combination of predictors
minrss <- sapply(rss, function(x) min(x[2,]))
# plot line through the minimum test RSS data points in blue
lines((1:p), minrss, col="red", lwd=1.7)
# add legend
legend("topright", c("Train", "Test"), col=c("blue", "red"), pch=1)

  • from the above, we can clearly that test RSS error approaches the minimum at around 3 predictors and increases slightly as more predictors are used

0.15.2 Split Samples

  • the best method to pick predictors/model is to split the given data into different test sets
  • process
    1. divide data into training/test/validation sets (60 - 20 - 20 split)
    2. train all competing models on the training data
    3. apply the models on validation data and choose the best performing model
    4. re-split data into training/test/validation sets and repeat steps 1 to 3
    5. apply the overall best performing model on test set to appropriately assess performance on new data
  • common problems
    • limited data = if not enough data is available, it may not be possible to produce a good model fit after splitting the data into 3 sets
    • computational complexity = modeling with all subsets of models can be extremely taxing in terms of computations, especially when a large number of predictors are available

0.15.3 Decompose Expected Prediction Error

  • the outcome \(Y_i\) can be modeled by \[Y_i = f(X_i) + \epsilon_i\] where \(\epsilon_i\) = error term
  • the expected prediction error is defined as \[EPE(\lambda) = E\left[\left(Y - \hat{f}_{\lambda}(X)\right)^2\right]\] where \(\lambda\) = specific set of tuning parameters
  • estimates from the model constructed with training data can be denoted as \(\hat{f}_{\lambda}(x^*)\) where \(X = x^*\) is the new data point that we would like to predict at
  • the expected prediction error is as follows \[\begin{aligned} E\left[\left(Y - \hat{f}_{\lambda}(x^*)\right)^2\right] & = \sigma^2 + \left(E[\hat{f}_{\lambda}(x^*)] - f(x^*)\right)^2 + E\left[\hat{f}_{\lambda}(x^*) - E[\hat{f}_{\lambda}(x^*)]\right]^2\\ & = \mbox{Irreducible Error} + \mbox{Bias}^2 + \mbox{Variance}\\ \end{aligned} \]
    • goal of prediction model = minimize overall expected prediction error
    • irreducible error = noise inherent to the data collection process \(\rightarrow\) cannot be reduced
    • bias/variance = can be traded in order to find optimal model (least error)

0.15.4 Hard Thresholding

  • if there are more predictors than observations (high-dimensional data), linear regressions will only return coefficients for some of the variables because there’s not enough data to estimate the rest of the parameters
    • conceptually, this occurs because the design matrix that the model is based on cannot be inverted
    • Note: ridge regression can help address this problem
  • hard thresholding can help estimate the coefficients/model by taking subsets of predictors and building models
  • process
    • model the outcome as \[Y_i = f(X_i) + \epsilon_i\] where \(\epsilon_i\) = error term
    • assume the prediction estimate has a linear form \[\hat{f}_{\lambda}(x) = x'\beta\] where only \(\lambda\) coefficients for the set of predictors \(x\) are nonzero
    • after setting the value of \(\lambda\), compute models using all combinations of \(\lambda\) variables to find which variables’ coefficients should be set to be zero
  • problem
    • computationally intensive
  • example
    • as we can see from the results below, some of the coefficients have values of NA
# load prostate data
data(prostate)
# create subset of observations with 10 variables
small = prostate[1:5,]
# print linear regression
lm(lpsa ~ .,data =small)
## 
## Call:
## lm(formula = lpsa ~ ., data = small)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##     9.60615      0.13901     -0.79142      0.09516           NA  
##         svi          lcp      gleason        pgg45    trainTRUE  
##          NA           NA     -2.08710           NA           NA

0.15.5 Regularized Regression Concept (Resource)

  • regularized regression = fit a regression model and adjust for the large coefficients in attempt to help with bias/variance trade-off or model selection
    • when running regressions unconstrained (without specifying any criteria for coefficients), the model may be susceptible to high variance (coefficients explode \(\rightarrow\) very large values) if there are variables that are highly correlated
    • controlling/regularizing coefficients may slightly increase bias (lose a bit of prediction capability) but will reduce variance and improve the prediction error
    • however, this approach may be very demanding computationally and generally does not perform as well as random forest/boosting
  • Penalized Residual Sum of Squares (PRSS) is calculated by adding a penalty term to the prediction squared error \[PRSS(\beta) = \sum_{j=1}^n (Y_j - \sum_{i=1}^m \beta_{1i} X_{ij})^2 + P(\lambda; \beta)\]
    • penalty shrinks coefficients if their values become too large
    • penalty term is generally used to reduce complexity and variance for the model, while respecting the structure of the data/relationship
  • example: co-linear variables
    • given a linear model, \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\] where \(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear)
    • the model can then be approximated by this model by omitting \(X_2\), so the model becomes \[Y = \beta_0 + (\beta_1 + \beta_2)X_1 + \epsilon\]
    • with the above model, we can get a good estimate of \(Y\)
      • the estimate of \(Y\) will be biased
      • but the variance of the prediction may be reduced

0.15.6 Regularized Regression - Ridge Regression

  • the penalized residual sum of squares (PRSS) takes the form of \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]
  • this is equivalent to solving the equation \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2\] subject to constraint \(\sum_{j=1}^p \beta_j^2 \leq s\) where \(s\) is inversely proportional to \(\lambda\)
    • if the coefficients \(\beta_j\) are large in value, the term \(\sum_{j=1}^p \beta_j^2\) will cause the overall PRSS value to increase, leading to worse models
    • the presence of the term thus requires some of the coefficients to be small
  • inclusion of \(\lambda\) makes the problem non-singular even if \(X^TX\) is not invertible
    • this means that even in cases where there are more predictors than observations, the coefficients of the predictors can still be estimated
  • \(\lambda\) = tuning parameter
    • controls size of coefficients or the amount of regularization
    • as \(\lambda \rightarrow 0\), the result approaches the least square solution
    • as \(\lambda \rightarrow \infty\), all of the coefficients receive large penalties and the conditional coefficients \(\hat{\beta}_{\lambda=\infty}^{ridge}\) approaches zero collectively
    • \(\lambda\) should be carefully chosen through cross-validation/other techniques to find the optimal tradeoff of bias for variance
    • Note: it is important realize that all coefficients (though they may be shrunk to very small values) will still be included in the model when applying ridge regression
  • R Commands
    • [MASS package] ridge<-lm.ridge(outcome ~ predictors, data=training, lambda=5) = perform ridge regression with given outcome and predictors using the provided \(\lambda\) value
      • Note: the predictors are centered and scaled first before the regression is run
      • lambda=5 = tuning parameter
      • ridge$xm = returns column/predictor mean from the data
      • ridge$scale = returns the scaling performed on the predictors for the ridge regression
        • Note: all the variables are divided by the biased standard deviation \(\sum (X_i - \bar X_i) / n\)
      • ridge$coef = returns the conditional coefficients, \(\beta_j\) from the ridge regression
      • ridge$ym = return mean of outcome
    • caret package train(outcome ~ predictors, data=training, method="ridge", lambda=5) = perform ridge regression with given outcome and predictors
      • preProcess=c("center", "scale") = centers and scales the predictors before the model is built
        • Note: this is generally a good idea for building ridge regressions
      • lambda=5 = tuning parameter
    • caret package train(outcome ~ predictors, data=training, method="foba", lambda=5, k=4) = perform ridge regression with variable selection
      • lambda=5 = tuning parameter
      • k=4 = number of variables that should be retained
        • this means that length(predictors)-k variables will be eliminated
    • caret package predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithms
  • example: ridge coefficient paths vs \(\lambda\)
    • using the same prostate dataset, we will run ridge regressions with different values of \(\lambda\) and find the optimum \(\lambda\) value that minimizes test RSS

0.15.7 Regularized Regression - LASSO Regression

  • LASSO (least absolute shrinkage and selection operator) was introduced by Tibshirani (Journal of the Royal Statistical Society 1996)
  • similar to ridge, with slightly different penalty term
  • the penalized residual sum of squares (PRSS) takes the form of \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2 + \lambda \sum_{j=1}^p |\beta_j|\]
  • this is equivalent to solving the equation \[PRSS(\beta_j) = \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j \right)^2\] subject to constraint \(\sum_{j=1}^p |\beta_j| \leq s\) where \(s\) is inversely proportional to \(\lambda\)
  • \(\lambda\) = tuning parameter
    • controls size of coefficients or the amount of regularization
    • large values of \(\lambda\) will set some coefficient equal to zero
      • Note: LASSO effectively performs model selection (choose subset of predictors) while shrinking other coefficients, where as Ridge only shrinks the coefficients
  • R Commands
    • [lars package] lasso<-lars(as.matrix(x), y, type="lasso", trace=TRUE) = perform lasso regression by adding predictors one at a time (or setting some variables to 0)
      • Note: the predictors are centered and scaled first before the regression is run
      • as.matrix(x) = the predictors must be in matrix/dataframe format
      • trace=TRUE = prints progress of the lasso regression
      • lasso$lambda = return the \(\lambda\)s used for each step of the lasso regression
      • plot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by one
      • predit.lars(lasso, test) = use the lasso model to predict on test data
        • Note: more information/documentation can be found in ?predit.lars
    • [lars package] cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE) = computes K-fold cross-validated mean squared prediction error for lasso regression
      • effectively the lars function is run K times with each of the folds to estimate the
      • K=10 = create 10-fold cross validation
      • trace=TRUE = prints progress of the lasso regression
    • [enet package] lasso<-enet(predictors, outcome, lambda = 0) = perform elastic net regression on given predictors and outcome
      • lambda=0 = default value for \(\lambda\)
        • Note: lasso regression is a special case of elastic net regression, and forcing lambda=0 tells the function to fit a lasso regression
      • plot(lasso) = prints plot that shows the progression of the coefficients as they are set to zero one by one
      • predict.ent(lasso, test)= use the lasso model to predict on test data
    • caret package train(outcome ~ predictors, data=training, method="lasso") = perform lasso regression with given outcome and predictors
      • Note: outcome and predictors must be in the same dataframe
      • preProcess=c("center", "scale") = centers and scales the predictors before the model is built
        • Note: this is generally a good idea for building lasso regressions
    • caret package train(outcome~predictors,data=train,method="relaxo",lambda=5,phi=0.3) = perform relaxed lasso regression on given predictors and outcome
      • lambda=5 = tuning parameter
      • phi=0.3 = relaxation parameter
        • phi=1 corresponds to the regular Lasso solutions
        • phi=0 computes the OLS estimates on the set of variables selected by the Lasso
    • caret package predict(model,test) = use the model to predict on test set \(\rightarrow\) similar to all other caret algorithms
  • example: lars package
# load lars package
library(lars)
# perform lasso regression
lasso.fit <- lars(as.matrix(x), y, type="lasso", trace=TRUE)
# plot lasso regression model
plot(lasso.fit, breaks=FALSE, cex = 0.75)
# add legend
legend("topleft", covnames, pch=8, lty=1:length(covnames),
    col=1:length(covnames), cex = 0.6)

# plots the cross validation curve
lasso.cv <- cv.lars(as.matrix(x), y, K=10, type="lasso", trace=TRUE)

\(\pagebreak\)

0.16 Combining Predictors

  • combining predictors = also known as ensembling methods in learning, combine classifiers by averaging/voting to improve accuracy (generally)
    • this reduces interpretability and increases computational complexity
    • boosting/bagging/random forest algorithms all use this idea, except all classifiers averaged are of the same type
  • Netflix Competition was won by a team that blended together 107 machine learning algorithms, Heritage Health Prize was also won by a combination of algorithms
    • Note: the winning algorithm for Netflix was not used because it was too computationally complex/intensive, so the trade-off between accuracy and scalability is very important
  • approach
    • combine similar classifiers using bagging/boosting/random forest
    • combine different classifiers using model stacking/ensembling
      • build an odd number of models (we need an odd number for the majority vote to avoid ties)
      • predict with each model
      • combine the predictions and predict the final outcome by majority vote
  • process: simple model ensembling
    1. build multiple models on the training data set
    2. use the models to predict on the training/test set
      • predict on training if the data was divided to only training/test sets
      • predict on test if the data was divided into training/test/validation sets
    3. combine the prediction results from each model and the true results for the training/test set into a new data frame
      • one column for each model
      • add the true outcome from the training/test set as a separate column
    4. train the new data frame with a new model the true outcome as the outcome, and the predictions from various models as predictors
    5. use the combined model fit to predict results on the training/test data
      • calculate the RMSE for all models, including combined fit, to evaluate the accuracy of the different models
      • Note: the RMSE for the combined fit should generally be lower than the the rest of the models
    6. to predict on the final test/validation set, use all the initial models to predict on the data set first to recreate a prediction data frame for the test/validation set like in step 3
      • one column for each model, no truth column this time
    7. apply the combined fit on the combined prediction data frame to get the final resultant predictions

0.16.1 Example - Majority Vote

  • suppose we have 5 independent classifiers/models
  • each has 70% accuracy
  • majority vote accuracy (mva) = probability of the majority of the models achieving 70% at the same time

\[\begin{aligned} \mbox{majority vote accuracy} & = p(3~correct,~2~wrong) + p(4~correct,~1~wrong) \\ &\qquad+ p(5~correct) \\ & = {5 \choose 3} \times(0.7)^3(0.3)^2 + {5 \choose 4} \times(0.7)^4(0.3)^1 - {5 \choose 5} (0.7)^5 \\ & = 10 \times(0.7)^3(0.3)^2 + 5 \times(0.7)^4(0.3)^2 - 1 \times (0.7)^5 \\ & = 83.7% \\ \end{aligned}\]

  • with 101 classifiers, the majority vote accuracy becomes 99.9%

0.16.2 Example - Model Ensembling

library(parallel)
library(doParallel)
# StartCPU
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
# set up data
inBuild <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
validation <- Wage[-inBuild,]; buildData <- Wage[inBuild,]
inTrain <- createDataPartition(y=buildData$wage,p=0.7, list=FALSE)
training <- buildData[inTrain,]; testing <- buildData[-inTrain,]
# train the data using both glm and random forest models
glm.fit <- train(wage ~.,method="glm",data=training)
rf.fit <- train(wage ~.,method="rf",data=training,
    trControl = trainControl(method="cv"),number=3)
# use the models to predict the results on the testing set
glm.pred.test <- predict(glm.fit,testing)
rf.pred.test <- predict(rf.fit,testing)
# combine the prediction results and the true results into new data frame
combinedTestData <- data.frame(glm.pred=glm.pred.test,
    rf.pred = rf.pred.test,wage=testing$wage)
# run a Generalized Additive Model (gam) model on the combined test data
comb.fit <- train(wage ~.,method="gam",data=combinedTestData)
# use the resultant model to predict on the test set
comb.pred.test <- predict(comb.fit, combinedTestData)
# use the glm and rf models to predict results on the validation data set
glm.pred.val <- predict(glm.fit,validation)
rf.pred.val <- predict(rf.fit,validation)
# combine the results into data frame for the comb.fit
combinedValData <- data.frame(glm.pred=glm.pred.val,rf.pred=glm.pred.val)
# run the comb.fit on the combined validation data
comb.pred.val <- predict(comb.fit,combinedValData)
# tabulate the results - test data set RMSE Errors
rbind(test = c(glm = sqrt(sum((glm.pred.test-testing$wage)^2)),
        rf = sqrt(sum((rf.pred.test-testing$wage)^2)),
        combined = sqrt(sum((comb.pred.test-testing$wage)^2))),
    # validation data set RMSE Errors
    validation = c(sqrt(sum((glm.pred.val-validation$wage)^2)),
        sqrt(sum((rf.pred.val-validation$wage)^2)),
        sqrt(sum((comb.pred.val-validation$wage)^2))))
##                  glm        rf  combined
## test        858.7074  893.4517  850.0222
## validation 1061.0891 1088.4277 1056.5144
# stopCPU
stopCluster(cluster)
registerDoSEQ()

\(\pagebreak\)

0.17 Forecasting

  • forecasting = typically used with time series, predict one or more observations into the future
    • data are dependent over time so subsampling/splitting data into training/test is more complicated and must be done very carefully
  • specific patterns need to be considered for time series data (time series decomposition)
    • trend = long term increase/decrease in data
    • seasonal = patterns related to time of week/month/year/etc
    • cyclic = patterns that rise/fall periodically
  • Note: issues that arise from time series are similar to those from spatial data
    • dependency between nearby observations
    • location specific effects
  • all standard predictions models can be used but requires more consideration
  • Note: more detailed tutorial can be found in Rob Hyndman’s Forecasting: principles and practice

  • considerations for interpreting results
    • unrelated time series can often seem to be correlated with each other (spurious correlations)
    • geographic analysis may exhibit similar patterns due to population distribution/concentrations
    • extrapolations too far into future can be dangerous as they can produce in insensible results
    • dependencies over time (seasonal effects) should be examined and isolated from the trends
  • process
    • ensure the data is a time series data type
    • split data into training and test sets
      • both must have consecutive time points
    • choose forecast approach (SMA - ma vs EMA - ets, see below) and apply corresponding functions to training data
    • apply constructed forecast model to test data using forecast function
    • evaluate accuracy of forecast using accuracy function
  • approaches
    • simple moving averages = prediction will be made for a time point by averaging together values from a number of prior periods \[ Y_{t}=\frac{1}{2*k+1}\sum_{j=-k}^k {y_{t+j}}\]

    • exponential smoothing/exponential moving average = weight time points that are closer to point of prediction than those that are further away \[\hat{y}_{t+1} = \alpha y_t + (1-\alpha)\hat{y}_{t-1}\]
      • Note: many different methods of exponential smoothing are available, more information can be found here

0.17.1 R Commands and Examples

  • quantmod package can be used to pull trading/price information for publicly traded stocks
    • getSymbols("TICKER", src="google", from=date, to=date) = gets the daily high/low/open/close price and volume information for the specified stock ticker
      • returns the data in a data frame under the stock ticker’s name
      • "TICKER" = ticker of the stock you are attempting to pull information for
      • src="google" = get price/volume information from Google finance
        • default source of information is Yahoo Finance
      • from and to = from and to dates for the price/volume information
        • both arguments must be specified with date objects
      • Note: more information about how to use getSymbols can be found in the documentation ?getSymbols
    • to.monthly(GOOG) = converts stock data to monthly time series from daily data
      • the function aggregates the open/close/high/low/volume information for each day into monthly data
      • GOOG = data frame returned from getSymbols function
      • Note: ?to.period contains documentation for converting time series to OHLC (open high low close) series
    • googOpen<-Op(GOOG) = returns the opening price from the stock data frame
      • Cl(), Hi(), Lo() = returns the close, high and low price from the stock data frame
    • ts(googOpen, frequency=12) = convert data to a time series with frequency observations per time unit
      • frequency=12 = number of observations per unit time (12 in this case because there are 12 months in each year \(\rightarrow\) converts data into yearly time series)
  • decompose(ts) = decomposes time series into trend, seasonal, and irregular components by using moving averages
    • ts = time series object
  • window(ts, start=1, end=6) = subsets the time series at the specified starting and ending points
    • start and end arguments must correspond to the time unit rather than the index
      • for instance, if the ts is a yearly series (frequency = 12), start/end should correspond to the row numbers or year (each year has 12 observations corresponding to the months)
      • c(1, 7) can be used to specify the element of a particular year (in this case, July of the first year/row)
      • Note: you can use 9.5 or any decimal as values for start/end, and the closest element (June of the 9th year in this case) will be used
      • Note: end=9-0.01 can be used a short cut to specify “up to 9”, since end = 9 will include the first element of the 9th row
  • forecast package can be used for forecasting time series data
    • ma(ts, order=3) = calculates the simple moving average for the order specified
      • order=3 = order of moving average smoother, effectively the number of values that should be used to calculate the moving average
    • ets(train, model="MMM") = runs exponential smoothing model on training data
      • model = "MMM" = method used to create exponential smoothing
        • Note: more information can be found at ?ets and the corresponding model chart is here
    • forecast(ts) = performs forecast on specified time series and returns 5 columns: forecast values, high/low 80 confidence intervals bounds, high/low 95 percent interval bounds
      • plot(forecast) = plots the forecast object, which includes the training data, forecast values for test periods, as well as the 80 and 95 percent confidence interval regions
    • accuracy(forecast, testData) = returns the accuracy metrics (RMSE, etc.) for the forecast model
  • quandl package is also used for finance-related predictions
  • example: decomposed time series
# load quantmod package
library(quantmod);
# specify to and from dates
from.dat <- as.Date("01/01/00", format="%m/%d/%y")
to.dat <- as.Date("3/2/15", format="%m/%d/%y")
# get data for AAPL from Google Finance for the specified dates
getSymbols("AAPL", src="yahoo", from = from.dat, to = to.dat) # Google Finance stopped providing data in March, 2018. 
## [1] "AAPL"
# convert the retrieved daily data to monthly data
mAAPL <- to.monthly(AAPL)
# extract the closing price and convert it to yearly time series (12 observations per year)
ts <- ts(Cl(mAAPL), frequency = 12)
# plot the decomposed parts of the time series
plot(decompose(ts),xlab="Years")

  • example: forecast
# load forecast library
library(forecast)
# find the number of rows (years)
rows <- ceiling(length(ts)/12)
# use 90% of the data to create training set
ts.train <- window(ts, start = 1, end = floor(rows*.9)-0.01)
# use the rest of data to create test set
ts.test <- window(ts, start = floor(rows*.9))
# plot the training set
plot(ts.train)
# add the moving average in red
lines(ma(ts.train,order=3),col="red")

# compute the exponential smoothing average
ets <- ets(ts.train,model="MMM")
# construct a forecasting model using the exponential smoothing function
fcast <- forecast(ets)
# plot forecast and add actual data in red
plot(fcast); lines(ts.test,col="red")

# print the accuracy of the forecast model
accuracy(fcast,ts.test)
##                        ME      RMSE       MAE        MPE     MAPE
## Training set  -0.01738163  2.698714  1.460038  -1.763296 10.52864
## Test set     -34.49892700 35.393299 34.498927 -44.319234 44.31923
##                   MASE       ACF1 Theil's U
## Training set 0.1785449 0.07841581        NA
## Test set     4.2187985 0.61247460  6.792398

\(\pagebreak\)

0.18 Unsupervised Prediction

  • supervised classification = predicting outcome when we know what the different classifications are
    • example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length
  • unsupervised classification = predicting outcome when we don’t know what the different classifications are
    • example: splitting all data for sepal width/length into different groups (cluster similar data together)
  • process
    • provided that the labels for prediction/outcome are unknown, we first build clusters from observed data
      • creating clusters are not noiseless process, and thus may introduce higher variance/error for data
      • K-means is an example of a clustering approach
    • label the clusters
      • interpreting the clusters well (sensible vs non-sensible clusters) is incredibly challenging
    • build prediction model with the clusters as the outcome
      • all algorithms can be applied here
    • in new data set, we will predict the clusters labels
  • unsupervised prediction is effectively a exploratory technique, so the resulting clusters should be carefully interpreted
    • clusters may be highly variable depending on the method through which the data is sample
  • generally a good idea to create custom clustering algorithms for given data as it is crucial to define the process to identify clusters for interpretability and utility of the model
  • unsupervised prediction = basic approach to recommendation engines, in which the tastes of the existing users are clustered and applied to new users

0.18.1 R Commands and Examples

  • kmeans(data, centers=3) = can be used to perform clustering from the provided data
    • centers=3 = controls the number of clusters the algorithm should aim to divide the data into
  • cl_predict function in clue package provides similar functionality
library(gridExtra);
# load iris data
data(iris)
# create training and test sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]; testing <- iris[-inTrain,]
# perform k-means clustering for the data without the Species information
# Species = what the true clusters are
kMeans1 <- kmeans(subset(training,select=-c(Species)),centers=3)
# add clusters as new variable to training set
training$clusters <- as.factor(kMeans1$cluster)
# plot clusters vs Species classification
p1 <- qplot(Petal.Width,Petal.Length,colour=clusters,data=training) +
    ggtitle("Clusters Classification")
p2 <- qplot(Petal.Width,Petal.Length,colour=Species,data=training) +
    ggtitle("Species Classification (Truth)")
grid.arrange(p1, p2, ncol = 2)

  • as we can see, there are three clear groups that emerge from the data
    • this is fairly close to the actual results from Species
    • we can compare the results from the clustering and Species classification by tabulating the values
# tabulate the results from clustering and actual species
table(kMeans1$cluster,training$Species)
##    
##     setosa versicolor virginica
##   1      0          2        26
##   2      0         33         9
##   3     35          0         0
  • with the clusters determined, the training data can be trained on all predictors with the clusters from k-means as outcome
# build classification trees using the k-means cluster
clustering <- train(clusters ~.,data=subset(training,select=-c(Species)),method="rpart")
  • we can compare the prediction results on training set vs truth
# tabulate the prediction results on training set vs truth
table(predict(clustering,training),training$Species)
##    
##     setosa versicolor virginica
##   1      0          0        23
##   2      0         35        12
##   3     35          0         0
  • similarly, we can compare the prediction results on test set vs truth
# tabulate the prediction results on test set vs truth
table(predict(clustering,testing),testing$Species)
##    
##     setosa versicolor virginica
##   1      0          0        11
##   2      0         15         4
##   3     15          0         0
THE END

Untitled

Posted on 2018-08-08 | Edited on 2016-02-15
Symbols count in article: 83k | Reading time ≈ 1:16
Developing Data Products Course Notes

Developing Data Products Course Notes

Xing Su

  • Shiny (tutorial)
    • Structure of Shiny App
    • ui.R
    • ui.R Example
    • server.R
    • server.R Example
    • Distributing Shiny Application
    • Debugging
  • manipulate Package
    • Example
  • rCharts
    • Example
  • ggvis package
  • GoogleVis API
    • Example (line chart)
    • Example (merging graphs)
  • ShinyApps.io (link)
  • plot.ly (link)
    • Example
  • Structure of a Data Analysis Report
  • Slidify
    • YAML (YAML Ain’t Markup Language/Yet Another Markup Language)
    • Slides
    • Publishing
  • RStudio Presentation
    • Creating Presentation
  • Slidify vs RStudio Presenter
  • R Package
    • R Package Components
    • DESCRIPTION file
    • R Code
    • NAMESPACE file
    • Documentation
    • Building/Checking Package
    • Checklist for Creating Package
    • Example: topten function
  • R Classes and Methods
    • Objected Oriented Programming in R
    • Creating a New Class/Methods
  • Yhat (link)
    • Deploying the Model
    • Accessing the Model

Shiny (tutorial)

  • Shiny = platform for creating interactive R program embedded on web pages (made by RStudio)
  • knowledge of these helpful: HTML (web page structure), css (style), JavaScript (interactivity)
  • OpenCPU = project created by Jerom Ooms providing API for creating more complex R/web apps
  • install.packages("shiny"); library(shiny) = install/load shiny package
  • capabilities
    • upload or download files
    • tabbed main panels
    • editable data tables
    • dynamic UI
    • user defined inputs/outputs
    • submit button to control when calculations/tasks are processed

Structure of Shiny App

  • two scripts (in one directory) make up a Shiny project
    • ui.R - controls appearance/all style elements
      • alternatively, a www directory with an index.html file enclosed can be used instead of ui.R
        • Note: output is rendered into HTML elements based on matching their id attribute to an output slot and by specifying the requisite CSS class for the element (in this case either shiny-text-output, shiny-plot-output, or shiny-html-output)
        • it is possible to create highly customized user-interfaces using user-defined HTML/CSS/JavaScript
    • server.R - controls functions
  • runApp() executes the Shiny application
    • runApp(display.mode = 'showcase') = displays the code from ui.R and server.R and highlights what is being executed depending on the inputs
  • Note: "," must be included ONLY INBETWEEN objects/functions on the same level

ui.R

  • library(shiny) = first line, loads the shiny package
  • shinyUI() = shiny UI wrapper, contains sub-methods to create panels/parts/viewable object
  • pageWithSideBar() = creates page with main/side bar division
  • headerPanel("title") = specifies header of the page
  • sideBarPanel() = specifies parameters/objects in the side bar (on the left)
  • mainPanel() = specifies parameters/objects in the main panel (on the right)
  • for better control over style, use shinyUI(fluidpage()) (tutorial) <– produces responsive web pages
    • fluidRow() = creates row of content with width 12 that can be subdivided into columns
      • column(4, ...) = creates a column of width 4 within the fluid row
      • style = "CSS" = can be used as the last element of the column to specify additional style
  • absolutePanel(top=0, left=0, right=0) = used to produce floating panels on top of the page (documentation)
    • fixed = TRUE = panel will not scroll with page, which means the panel will always stay in the same position as you scroll through the page
    • draggable = TRUE = make panel movable by the user
    • top = 40 / bottom = 50 = position from the top/bottom edge of the browser window
      • top = 0, bottom = 0 = creates panel that spans the entire vertical length of window
    • left = 40 / right = 50 = position from the left/right edge of the browser window
      • top = 0, bottom = 0 = creates panel that spans the entire horizontal length of window
    • height = 30 / width = 40 = specifies the height/width of the panel
    • style = "opacity:0.92; z-index = 100" = makes panel transparent and ensures the panel is always the top-most element
  • content objects/functions
    • Note: more HTML tags can be found here
    • Note: most of the content objects (h1, p, code, etc) can use both double and single quotes to specify values, just be careful to be consistent
    • h1/2/3/4('heading') = creates heading for the panel
    • p('pargraph') = creates regular text/paragraph
    • code('code') = renders code format on the page
    • br() = inserts line break
    • tags$hr() = inserts horizontal line
    • tags$ol()/ tags$ul() = initiates ordered/unordered list
    • div( ... , style = "CSS Code") / span( ... , style = "CSS Code") = used to add additional style to particular parts of the app
      • div should be used for a section/block, span should be used for a specific part/inline
    • withMathJax() = add this element to allow Shiny to process LaTeX
      • inline LaTex must be wrapped like this: \\(LaTeX\\)
      • block equations are still wrapped by: $$LaTeX$$
  • inputs
    • textInput(inputId = "id", label = "textLabel") = creates a plain text input field
      • inputId = field identifier
      • label = text that appear above/before a field
    • numericInput('HTMLlabel', 'printedLabel', value = 0, min = 0, max = 10, step = 1) = create a number input field with incrementer (up/down arrows)
      • 'HTMLLabel' = name given to the field, not printed, and can be called
      • 'printedLabel' = text that shows up above the input box explaining the field
      • value = default numeric value that the field should take; 0 is an example
      • min = minimum value that can be set in the field (if a smaller value is manually entered, then the value becomes the minimum specified once user clicks away from the field)
      • max = max value that can be set in the field
      • step = increments for the up/down arrows
      • more arguments can be found in ?numericInput
    • checkboxGroupInput("id2", "Checkbox",choices = c("Value 1" = "1", ...), selected = "1", inline = TRUE) = creates a series of checkboxes
      • "id2", "Checkbox" = field identifier/label
      • choices = list of checkboxes and their labels
        • format = "checkboxName" = "fieldIdentifier"
        • Note: fieldIdentifier should generally be different from checkbox to checkbox, so we can properly identify the responses
      • selected = specifies the checkboxes that should be selected by default; uses fieldIndentifier values
      • inline = whether the options should be displayed inline
    • dateInput("fieldID", "fieldLabel") = creates a selectable date field (dropdown calendar/date picker automatically generated)
      • "fieldID" = field identifier
      • "fieldLabel" = text/name displayed above fields
      • more arguments can be found in ?dateInput
    • submitButton("Submit") = creates a submit button that updates the output/calculations only when the user submits the new inputs (default behavior = all changes update reactively/in real time)
    • actionButton(inputId = "goButton", label = "test") = creates a button with the specified label and id
      • output can be specified for when the button is clicked
    • sliderInput("id", "label", value = 70, min = 62, max = 74, 0.05) = creates a slider for input
      • arguments similar to numericInput and more information can be found ?sliderInput
  • outputs
    • Note: every variable called here must have a corresponding method corresponding method from the output element in server.R to render their value
    • textOutput("fieldId", inline = FALSE) = prints the value of the variable/field in text format
      • inline = TRUE = inserts the result inline with the HTML element
      • inline = FALSE = inserts the result in block code format
    • verbatimTextOutput("fieldId") = prints out the value of the specified field defined in server.R
    • plotOutput('fieldId') = plots the output (‘sampleHist’ for example) created from server.R script
    • output$test <- renderText({input$goButton}); isolate(paste(input$t1, input$2))}) = isolate action executes when the button is pressed
      • if (input$goButton == 1){ Conditional statements } = create different behavior depending on the number of times the button is pressed

ui.R Example

  • below is part of the ui.R code for a project on Shiny
# load shiny package
library(shiny)
# begin shiny UI
shinyUI(navbarPage("Shiny Project",
    # create first tab
    tabPanel("Documentation",
        # load MathJax library so LaTeX can be used for math equations
        withMathJax(), h3("Why is the Variance Estimator \\(S^2\\) divided by \\(n-1?\\)"),
        # paragraph and bold text
        p("The ", strong("sample variance")," can be calculated in ", strong(em("two")),
          " different ways:",
          "$$S^2 \\mbox{(unbiased)} = \\frac{\\sum_{i=1}^n (X_i - \\bar X)^2}{n-1}
          ~~~\\mbox{and}~~S^2\\mbox{(biased)}=\\frac{\\sum_{i=1}^n (X_i-\\bar X)^2}{n}$$",
          "The unbiased calculation is most often used, as it provides a ",
          strong(em("more accurate")), " estimate of population variance"),
        # break used to space sections
        br(), p("To show this empirically, we simulated the following in the ",
                strong("Simulation Experiment"), " tab: "), br(),
        # ordered list
        tags$ol(
            tags$li("Create population by drawing observations from values 1 to 20."),
            tags$li("Draw a number of samples of specified size from the population"),
            tags$li("Plot difference between sample and true population variance"),
            tags$li("Show the effects of sample size vs accuracy of variance estimated")
        )),
    # second tab
    tabPanel("Simulation Experiment",
        # fluid row for space holders
        fluidRow(
            # fluid columns
            column(4, div(style = "height: 150px")),
            column(4, div(style = "height: 150px")),
            column(4, div(style = "height: 150px"))),
        # main content
        fluidRow(
            column(12,h4("We start by generating a population of ",
                         span(textOutput("population", inline = TRUE),
                         style = "color: red; font-size: 20px"),
                         " observations from values 1 to 20:"),
                   tags$hr(),htmlOutput("popHist"),
                   # additional style
                   style = "padding-left: 20px"
            )
        ),
        # absolute panel
        absolutePanel(
            # position attributes
            top = 50, left = 0, right =0,
            fixed = TRUE,
            # panel with predefined background
            wellPanel(
                fluidRow(
                    # sliders
                    column(4, sliderInput("population", "Size of Population:",
                                          min = 100, max = 500, value = 250),
                           p(strong("Population Variance: "),
                           textOutput("popVar", inline = TRUE))),
                    column(4, sliderInput("numSample", "Number of Samples:",
                                          min = 100, max = 500, value = 300),
                           p(strong("Sample Variance (biased): "),
                           textOutput("biaVar", inline = TRUE))),
                    column(4, sliderInput("sampleSize", "Size of Samples:",
                                          min = 2, max = 15, value = 10),
                           p(strong("Sample Variance (unbiased): "),
                           textOutput("unbiaVar", inline = TRUE)))),
                style = "opacity: 0.92; z-index: 100;"
            ))
    )
))

server.R

  • preamble/code to set up environment (executed only once)
    • start with library() calls to load packages/data
    • define/initiate variables and relevant default values
      • <<- operator should be used to assign values to variables in the parent environment
      • x <<- x + 1 will define x to be the sum of 1 and the value of x (defined in the parent environment/working environment)
    • any other code that you would like to only run once
  • shinyServer() = initiates the server function
    • function(input, output){} = defines a function that performs actions on the inputs user makes and produces an output object
    • non-reactive statements/code will be executed once for each page refresh/submit
    • reactive functions/code are run repeatedly as values are updated (i.e. render)
      • Note: Shiny only runs what is needed for reactive statements, in other words, the rest of the code is left alone
      • reactive(function) = can be used to wrap functions/expressions to create reactive expressions
        • renderText({x()}) = returns value of x, “()” must be included (syntax)
  • reactive function example

    # start shinyServer
    shinyServer(
    # specify input/output function
    function(input, output) {
        # set x as a reactive function that adds 100 to input1
        x <- reactive({as.numeric(input$text1)+100})
        # set value of x to output object text1
        output$text1 <- renderText({x()                          })
        # set value of x plus value of input object text2 to output object text1
        output$text2 <- renderText({x() + as.numeric(input$text2)})
    }
    )
  • functions/output objects in shinyServer()
    • output$oid1 <- renderPrint({input$id1}) = stores the user input value in field id1 and stores the rendered, printed text in the oid1 variable of the output object
      • renderPrint({expression}) = reactive function to render the specified expression
      • {} is used to ensure the value is an expression
      • oid1 = variable in the output object that stores the result from the subsequent command
    • output$sampleHist <- renderPlot({code}) = stores plot generated by code into sampleHist variable
      • renderPlot({code}) = renders a plot generated by the enclosed R code
    • output$sampleGVisPlot <- renderGvis({code}) = renders Google Visualization object

server.R Example

  • below is part of the server.R code for a project on Shiny that uses googleVis
# load libraries
library(shiny)
require(googleVis)
# begin shiny server
shinyServer(function(input, output) {
    # define reactive parameters
    pop<- reactive({sample(1:20, input$population, replace = TRUE)})
    bootstrapSample<-reactive({sample(pop(),input$sampleSize*input$numSample,
        replace = TRUE)})
    popVar<- reactive({round(var(pop()),2)})
    # print text through reactive funtion
    output$biaVar <- renderText({
        sample<- as.data.frame(matrix(bootstrapSample(), nrow = input$numSample,
            ncol =input$sampleSize))
        return(round(mean(rowSums((sample-rowMeans(sample))^2)/input$sampleSize), 2))
    })
    # google visualization histogram
    output$popHist <- renderGvis({
        popHist <- gvisHistogram(data.frame(pop()), options = list(
            height = "300px",
            legend = "{position: 'none'}", title = "Population Distribution",
            subtitle = "samples randomly drawn (with replacement) from values 1 to 20",
            histogram = "{ hideBucketItems: true, bucketSize: 2 }",
            hAxis = "{ title: 'Values', maxAlternation: 1, showTextEvery: 1}",
            vAxis = "{ title: 'Frequency'}"
        ))
        return(popHist)
    })
})

Distributing Shiny Application

  • running code locally = running local server and browser routing through local host
  • quickest way = send application directory
  • possible to create R package and create a wrapper that calls runApp (requires R knowledge)
  • another option = run shiny server (link)

Debugging

  • runApp(display.mode = 'showcase') = highlights execution while running a shiny application
  • cat = can be used to display output to stdout/R console
  • browser() = interrupts execution (tutorial)

manipulate Package

  • manipulate = package/function can be leveraged to create quick interactive graphics by allowing the user to vary the different variables to a model/calculation
  • creates sliders/checkbox/picker for the user (documentation)

Example

# load data and manipulate package
library(UsingR)
library(manipulate)
# plotting function
myHist <- function(mu){
    # histogram
    hist(galton$child,col="blue",breaks=100)
    # vertical line to highlight the mean
    lines(c(mu, mu), c(0, 150),col="red",lwd=5)
    # calculate mean squared error
    mse <- mean((galton$child - mu)^2)
    # updates the mean value as the mean is changed by the user
    text(63, 150, paste("mu = ", mu))
    # updates the mean squared error value as the mean is changed by the user
    text(63, 140, paste("MSE = ", round(mse, 2)))
}
# creates a slider to vary the mean for the histogram
manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))

rCharts

  • rCharts = simple way of creating interactive JavaScript visualization using R
    • more in-depth knowledge of D3 will be needed to create more complex tools
    • written by Ramnath Vaidyanathan
    • uses formula interface to specify plots (like the lattice plotting system)
    • displays interactive tool tips when hovering over data points on the plots
  • installation
    • devtools must be installed first (install.packages("devtools"))
    • require(devtools); install_github('rCharts', 'ramnathv') installs the rCharts package from GitHub
  • plot types
    • Note: each of the following JS library has different default styles as well as a multitude of capabilities; the following list is just what was demonstrated and more documentation can be found in the corresponding links
    • [Polychart js library] rPlot = paneled scatter plots
    • [Morris js library] mPlot = time series plot (similar to stock price charts)
    • [NVD3 js library] nPlot = stacked/grouped bar charts
    • [xCharts js library] = shaded line graphs
    • [HighChart js library] = stacked (different styles) scatter/line charts
    • [LeafLet js library] = interactive maps
    • [Rickshaw js library] = stacked area plots/time series
  • rChart objects have various attributes/functions that you can use when saved n1 <- nplot(...)
    • n1$ + TAB in R Console brings up list of all functions contained in the object
    • n1$html() = prints out the HTML for the plot
    • n1$save("filename.html") = saves result to a file named “filename.html”
    • n1$print() = print out the JavaScript
  • n1$show("inline", include_assets = TRUE, cdn = F) = embed HTML/JS code directly with in Rmd file (for HTML output)
    • n1$publish('plotname', host = 'gist'/'rpubs') = publishes the plot under the specified plotname as a gist or to rpubs
  • to use with slidify,
  • the following YAML (Yet Another Markup Language/YAML Ain’t Markup Language) must be added
    • yaml ext_widgets : {rCharts: ["libraries/nvd3"]}
  • cat('<iframe src="map3.html" width=100%, height=600></iframe>') to embed a map or chart form a saved file (saved with: map3$save('map3.html', cdn = TRUE))

Example

# load rCharts package
require(rCharts); library(datasets); library(knitr)
# create dataframe with HairEyeColor data
haireye = as.data.frame(HairEyeColor)
# create a nPlot object
n1 <- nPlot(Freq ~ Hair, group = 'Eye', type = 'multiBarChart',
            data = subset(haireye, Sex == 'Male'))
# save the nPlot object to a html page
n1$show("inline", include_assets = TRUE, cdn = F)

ggvis package

  • ggvis is a data visualization package for R that lets you:
    • declaratively describe data graphics with a syntax similar in spirit to ggplot2
    • create rich interactive graphics that you can play with locally in Rstudio or in your browser
    • leverage shiny’s infrastructure to publish interactive graphics usable from any browser (either within your company or to the world).
  • the goal is to combine the best of R and the best of the web
    • data manipulation and transformation are done in R
    • graphics are rendered in a web browser, using Vega
    • for RStudio users, ggvis graphics display in a viewer panel, which is possible because RStudio is a web browser
  • can use the pipe operator, %>%, to chain graphing functions
    • set_options(renderer = "canvas") = can be used to control what renderer the graphics is produced with
    • example: mtcars %>% ggvis(~mpg, ~wt, fill = ~ as.factor(am)) %>% layer_points() %>% layer_smooths()

  • Renderer: SVG | Canvas
  • Download

GoogleVis API

  • GoogleVis allows R to create interactive HTML graphics (Google Charts)
  • chart types (format = gvis+ChartType)
    • Motion charts: gvisMotionChart
    • Interactive maps: gvisGeoChart
    • Interactive tables: gvisTable
    • Line charts: gvisLineChart
    • Bar charts: gvisColumnChart
    • Tree maps: gvisTreeMap
    • more charts can be found here
  • configuration options and default values for arguments for each of the plot types can be found here
  • print(chart, "chart") = prints the JavaScript for creating the interactive plot so it can be embedded in slidify/HTML document
    • print(chart) = prints HTML + JavaScript directly
  • alternatively, to print the charts on a HTML page, you can use op <- options(gvis.plot.tag='chart')
  • this sets the googleVis options first to change the behaviour of plot.gvis, so that only the chart component of the HTML file is written into the output file
  • plot(chart) can then be called to print the plots to HTML
  • gvisMerge(chart1, chart2, horizontal = TRUE, tableOptions = "bgcolor = \"#CCCCCC\" cellspacing = 10) = combines the two plots into one horizontally (1 x 2 panel)
    • Note: gvisMerge() can only combine TWO plots at a time
    • horizontal = FALSE = combines plots vertically (TRUE for horizontal combination)
    • tableOptions = ... = used to specify attributes of the combined plot
  • demo(googleVis) = demos how each of the plot works
  • resources
    • vignette
    • documentation
    • plot gallery
    • FAQ

Example (line chart)

# load googleVis package
library(googleVis)
# set gvis.plot options to only return the chart
op <- options(gvis.plot.tag='chart')
# create initial data with x variable as "label" and y variable as "var1/var2"
df <- data.frame(label=c("US", "GB", "BR"), val1=c(1,3,4), val2=c(23,12,32))
# set up a gvisLineChart with x and y
Line <- gvisLineChart(df, xvar="label", yvar=c("val1","val2"),
        # set options for the graph (list) - title and location of legend
        options=list(title="Hello World", legend="bottom",
                # set title text style
                titleTextStyle="{color:'red', fontSize:18}",
                # set vertical gridlines
                vAxis="{gridlines:{color:'red', count:3}}",
                # set horizontal axis title and style
                hAxis="{title:'My Label', titleTextStyle:{color:'blue'}}",
                # set plotting style of the data
                series="[{color:'green', targetAxisIndex: 0},
                         {color: 'blue',targetAxisIndex:1}]",
                # set vertical axis labels and formats
                vAxes="[{title:'Value 1 (%)', format:'##,######%'},
                                  {title:'Value 2 (\U00A3)'}]",
                # set line plot to be smoothed and set width and height of the plot
                curveType="function", width=500, height=300
                ))
# print the chart in JavaScript
plot(Line)

Example (merging graphs)

G <- gvisGeoChart(Exports, "Country", "Profit",options=list(width=200, height=100))
T1 <- gvisTable(Exports,options=list(width=200, height=270))
M <- gvisMotionChart(Fruits, "Fruit", "Year", options=list(width=400, height=370))
GT <- gvisMerge(G,T1, horizontal=FALSE)
GTM <- gvisMerge(GT, M, horizontal=TRUE,tableOptions="bgcolor=\"#CCCCCC\" cellspacing=10")
plot(GTM)
  • Note: the motion chart only displays when it is hosted on a server or a trusted Macromedia source, see googlVis vignette for more details

ShinyApps.io (link)

  • platform created by RStudio to share Shiny apps on the web
  • log in through GitHub/Google, and set up access in R
    1. Make sure you have devtools installed in R (install.packages("devtools"))
    2. enter devtools::install_github('rstudio/shinyapps'), which installs the shinyapps package from GitHub
    3. follow the instructions to authenticate your shiny apps account in R through the generated token
    4. publish your app through deployApp() command
  • the apps your deploy will be hosted on ShinyApps.io under your account

plot.ly (link)

  • platform share and edit plots modularly on the web (examples)
    • every part of the plot can be customized and modified
    • graphs can be converted from one language to another
  • you can choose to log in through Facebook/Twitter/Google/GitHub
    1. make sure you have devtools installed in R
    2. enter devtools::install_github("ropensci/plotly"), which installs plotly package from GitHub
    3. go to https://plot.ly/r/getting-started/ and follow the instructions
    4. enter library(plotly); set_credentials_file("<username>", "<token>") with the appropriate username and token filled in
    5. use plotly() methods to upload plots to your account
    6. modify any part of the plot as you like once uploaded
    7. share the plot

Example

# load packages
library(plotly); library(ggplot2)
# make sure your plot.ly credentials are set correctly using the following command
#   set_credentials_file(username=<FILL IN>, api_key=<FILL IN>)

# load data
load("courseraData.rda")
# bar plot using ggplot2
g <- ggplot(myData, aes(y = enrollment, x = class, fill = as.factor(offering)))
g <- g + geom_bar(stat = "identity")
g

# initiate plotly object
py <- plotly()
# interface with plot.ly and ggplot2 to upload the plot to plot.ly under your credentials
out <- py$ggplotly(g)
# typing this in R console will return the url of the generated plot
out$response$url
## [1] "https://plot.ly/~sxing/75"
  • the above is the response URL for the plot created on plot.ly
  • Note: this particular URL corresponds to the plot on my personal account

Structure of a Data Analysis Report

  • can be blog post or formal report for analysis of given dataset
  • components
    • prompt file = analysis being performed (this file is not mandatory/available in all cases)
    • data folder = data to be analyzed + report files
      • Note: always keep track of data provenance (for reproducibility)
    • code folder= rawcode vs finalcode
      • rawcode = analysis performed (.Rmd file)
        • all exploratory data analysis
        • not for sharing with others
      • finalcode = only analysis to be shared with other people/summarizations (.Rmd file)
        • not necessarily final but pertinent to analysis discussed in final report
      • figures folder = final plots that have all appropriate formatting for distribution/presentation
      • writing folder = final analysis report and final figure caption
        • report should have the following sections: title, introduction, methods, results, conclusions, references
        • final figure captions should correspond with the figures created

Slidify

  • create data-centric presentations created by Ramnath Vaidyanathan
  • amalgamation of knitr, Markdown, JavaScript libraries for HTML5 presentations
  • easily extendable/customizable
  • allows embedded code chunks and mathematical formulas (MathJax JS library) to be rendered correctly
  • final products are HTML files, which can be viewed with any web browser and shared easily

  • installation
    1. make sure you have devtools package installed in R
    2. enter install_github('slidify', 'ramnathv'); install_github('slidifyLibraries', 'ramnathv') to install the slidify packages
    3. load slidify package with library(slidify)
    4. set the working directory to the project you are working on with setwd("~/project")
  • author("title") = sets up initial files for a new slidify project (performs the following things)
    1. title (or any name you typed) directory is created inside the current working directory
    2. assets subdirectory and a file named index.Rmd are created inside title directory
    3. assets subdirectory is populated with the following empty folders:
      • css
      • img
      • js
      • layouts
      • Note: any custom CSS/images/JavaScript you want to use should be put into the above folders correspondingly
    4. index.Rmd R Markdown file will open up in RStudio
  • slidify("index.Rmd") = processes the R Markdown file into a HTML page and imports all necessary libraries
  • library(knitr); browseURL("index.html") = opens up the built-in web browser in R Studio and displays the slidify presentation
    • Note: this is only necessary the first time; you can refresh the page to reflect any changes after saving the HTML file

YAML (YAML Ain’t Markup Language/Yet Another Markup Language)

  • used to specify options for the R Markdown/slidify at the beginning of the file
  • format: field : value # comment
    • title = title of document
    • subtitle = subtitle of document
    • author = author of document
    • job = occupation of author (can be left blank)
    • framework = controls formatting, usually the name of a library is used (i.e. io2012)
      • io2012
      • html5slides
      • deck.js
      • dzslides
      • landslide
      • Slidy
    • highlighter = controls effects for presentation (i.e highlight.js)
    • hitheme = specifies theme of code (i.e. tomorrow)
    • widgets = loads additional libraries to display LaTeX math equations(mathjax), quiz-styles components (quiz), and additional style (bootstrap = Twitter-created style)
      • for math expressions, the code should be enclosed in $expresion$ for inline expressions, and $$expression$$ for block equations
    • mode = selfcontained/standalone/draft = depending whether the presentation will be given with Internet access or not
      • standalone = all the JavaScript libraries will be save locally so that the presentation can be executed without Internet access
      • selfcontained = load all JavaScript library at time of presentation
    • logo = displays a logo in title slide
    • url = specify path to assets/other folders that are used in the presentation
      • Note: ../ signifies the parent directory
  • example
---
title       : Slidify
subtitle    : Data meets presentation
author      : Jeffrey Leek, Assistant Professor of Biostatistics
job         : Johns Hopkins Bloomberg School of Public Health
logo        : bloomberg_shield.png
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      #
url:
  lib: ../../libraries
  assets: ../../assets
widgets     : [mathjax]            # {mathjax, quiz, bootstrap}
mode        : selfcontained # {standalone, draft}
---

Slides

  • ## = signifies the title of the slide \(\rightarrow\) equivalent of h1 element in HTML
  • --- = marks the end of a slide
  • .class #id = assigns class and id attributes (CSS) to the slide and can be used to customize the style of the page
  • Note: make sure to leave space between each component of the slidify document (title, code, text, etc) to avoid errors
  • advanced HTML can be added directly to the index.Rmd file and most of the time it should function correctly
  • interactive element (quiz questions, rCharts, shiny apps) can be embedded into slidify documents (demos)
    • quiz elements
      • --- &radio before slide content for multiple choice (make sure quiz is included in widgets)
      • ## = signifies title of questions
      • the question can be type in plain text format
      • the multiple choice options are listed by number (1. a, 2. b, etc.)
        • wrap the correct answer in underscores (2. _b_)
      • *** .hint = denotes the hint that will be displayed when the user clicks on Show Hint button
      • *** .explanation = denotes the explanation that will be displayed when the user clicks on Show Answer button
      • a page like the one below will be generated when processed with slidify

--- &radio

## Question 1

What is 1 + 1?

1. 1
2. _2_
3. 3
4. 4

*** .hint
This is a hint

*** .explanation
This is an explanation

  • knit HTML button can be used to generate previews for the presentation as well

Publishing

  • first, you will need to create a new repository on GitHub
  • publish_github("user", "repo") can be used to publish the slidify document on to your on-line repo

RStudio Presentation

  • presentation authoring tool within the RStudio IDE (tutorial)
  • output = html5 presentation
  • .Rpres file \(\rightarrow\) converted to .md file \(\rightarrow\) .html file
  • uses R Markdown format as slidify/knitr
    • mathjax JS library is loaded by default
  • RStudio format/runs the code when the document is saved

Creating Presentation

  • file \(\rightarrow\) New File \(\rightarrow\) R Presentation (alt-f + f + p)
  • class: classname = specify slide-specific control from CSS
  • css: file.css = can be used to import an external CSS file
    • alternatively, a css file that has the same name as the presentation will be automatically loaded
  • knowledge of CSS/HTML/JavaScript useful to customize presentation more granularly
    • Note: though the end HTML file can be edited directly, it should be used as a last resort as it defeats the purpose of reproducible presentations
  • clicking on Preview button brings up Presentation viewer in RStudio
    • navigation controls (left and right arrows) are located in right bottom corner
    • the Notepad icon on the menu bar above displays the section of code that corresponds with the current slide in the main window
    • the More button has four options
      • “Clear Knitr Cache” = clears cache for the generated presentation previews
      • “View in Browser” = creates temporary HTML file and opens in default web browser (does not create a local file)
      • “Save as Web Page” = creates a copy of the presentation as a web page
      • “Publish to RPubs” = publishes presentation on RPubs
    • the Refresh button refreshes the page
    • the Zoom button opens a new window to display the presentation
  • transitions between slides
    • just after the beginning of each slide, the transition property (similar to YAML) can be specified to control the transition between the previous and current slides
    • transition: linear = creates 2D linear transition (html5) between slides
    • transition: rotate = creates 3D rotating transition (html5) between slides
    • more transition options are found here
  • hierarchical organization
    • attribute type can be added to specify the appearance of the slide (“slide type”)
    • type: section and type: sub-section = distinct background and font colors, slightly larger heading text, appear at a different indent level within the slide navigation menu
    • type: prompt and type: alert = distinct background color to communicate to viewers that the slide has different intent
  • columns
    • simply place *** in between two sections of content on a slide to separate it into two columns
    • left: 70% can be used to specify the proportions of each column
    • right: 30% works similarly
  • change slide font (guide)
    • font-family: fontname = changes the font of slide (specified in the same way as HTML)
    • font-import: http://fonts.googleapis.com/css?family=Risque = imports font
      • Note: fonts must be present on the system for presentation (or have Internet), or default fonts will be used
    • Note: CSS selectors for class and IDs must be preceded by .reveal to work (.reveal section del applies to any text enclosed by ~~text~~)

Slidify vs RStudio Presenter

  • Slidify
    • flexible control from the .Rmd file
    • under constant development
    • large user base, more likely to get answer on StackOverflow
    • lots of styles and options by default
    • steeper learning curve
    • more command-line oriented
  • R Studio Presenter
    • embedded in R Studio
    • more GUI oriented
    • very easy to get started
    • smaller set of easy styles and options
    • default styles look nice
    • as flexible as Slidify with CSS/HTML knowledge

R Package

  • R packages = collection of functions/data objects, extends base functionality of R
    • organized to provide consistency
    • written by people all over the world
  • primarily available from CRAN and Bioconductor
    • installed with install.packages()
  • also available on GitHub/BitBucket/etc
    • installed using devtools::install_github()
  • documentation/vignettes = forces author to provide detailed explanation for the arguments/results of their functions and objects
    • allows for well defined Application Programming Interface (API) to tell users what and how to use the functions
    • much of the implementation details can be hidden from the user so that updates can be made without interfering with use cases
  • if package is available on CRAN then it must hold standards of reliability and robustness
  • fairly easily maintained with proper documentation

  • package development process
    • write code in R script
    • desire to make code available to others
    • incorporate R script file into R package structure
    • write documentation for user functions
    • include examples/demos/datasets/tutorials
    • package the contents
    • submit to CRAN/Bioconductor
    • push source code repository to GitHub/other code sharing site
      • people find problems with code and expect the author to fix it
      • alternatively, people might fix the problem and show the author the changes
    • incorporate changes and release a new version

R Package Components

  • directory with name of R package = created as first step
  • DESCRIPTION file = metadata/information about package
  • R code = code should be in R/ sub-directory
  • Documentation = file should be in man/ sub-directory
  • NAMESPACE = optional but common and best practice
  • full requirements documented in Writing R Extensions

DESCRIPTION file

  • Package = name of package (e.g. library(name) to load the package)
  • Title = full name of package
  • description = longer description of package in one or two sentences
  • Version = version number (usually M.m-p format, “majorNumber.minorNumber-patchLevel”)
  • Author = Name of the original author(s)
  • Maintainer = name + email of person (maintainer) who fixes problems
  • License = license for the source code, describes the term that the source code is released under
    • common licenses include GNU/BSD/MIT
    • typically a standard open source license is used
  • optional fields
    • Depends = R packages that your package depends on
    • Suggests = optional R packages that users may want to have installed
    • Date = release date in YYYY-MM-DD format
    • URL = package home page/link to repository
    • Other = fields can be added (generally ignored by R)
  • example: gpclib
    • Package: gpclib
    • Title: General Polygon Clipping Library for R
    • Description: General polygon clipping routines for R based on Alan Murta’s C library
    • Version: 1.5-5
    • Author: Roger D. Peng rpeng@jhsph.edu with contributions from Duncan Murdoch and Barry Rowlingson; GPC library by Alan Murta
    • Maintainer: Roger D. Peng rpeng@jhsph.edu
    • License: file LICENSE
    • Depends: R (>= 2.14.0), methods
    • Imports: graphics
    • Date: 2013-04-01
    • URL: http://www.cs.man.ac.uk/~toby/gpc/, http://github.com/rdpeng/gpclib

R Code

  • copy R code to R/ directory
  • can be any number of files
  • separate out files to logical groups (read/fit models)
  • all code should be included here and not anywhere else in the package

NAMESPACE file

  • effectively an API for the package
  • indicates which functions are exported \(\rightarrow\) public functions that users have access to and can use
    • functions not exported cannot be called directly by the user
    • hides the implementation details from the users (clean package interface)
  • lists all dependencies on other packages/indicate what functions you imported from other packages
    • allows for your package to use other packages without making them visible to the user
    • importing a function loads the package but does not attach it to the search list
  • key directives
    • export("\<function>") = export a function
    • import("\<package>") = import a package
    • importFrom("\<package>", "\<function>") = import specific function from a package
    • exportClasses("\<class>") = indicate the new types of S4 (4th version of S) classes created with the package (objects of the specified class can be created)
    • exportMethods("\<generic>") = methods that can operate on the new class objects
    • Note: though they look like R functions, the above directives are not functions that users can use freely
  • example
# read.polyfile/write.polyfile are functions available to user
export("read.polyfile", "write.polyfile")
# import plot function from graphics package
importFrom(graphics, plot)
# gpc.poly/gpc.poly.nohole classes can be created by the user
exportClasses("gpc.poly", "gpc.poly.nohole")
# the listed methods can be applied to the gpc.poly/gpc.poly.nohole classes
exportMethods("show", "get.bbox", "plot", "intersect”, "union”, "setdiff",
              "[", "append.poly", "scale.poly", "area.poly", "get.pts",
              "coerce", "tristrip", "triangulate")

Documentation

  • documentation files (.Rd) should be placed in the man/ sub-directory
  • written in specific markup language
  • required for every exported function available to the user (serves to limit the number of exported functions)
  • concepts/package/datasets overview can also be documented

  • components
    • \name{} = name of function
    • \alias{} = anything listed as alias will bring up the help file (?line is the same as ?residuals.tukeyline)
      • multiple aliases possible
    • \title{} = full title of the function
    • \description{} = full description of the purpose of function
    • \usage{} = format/syntax of function
    • \arguments{} = explanation of the arguments in the syntax of function
    • \details{} = notes/details about limitation/features of the function
    • \value{} = specifies what object is returned
    • \reference{} = references for the function (paper/book from which the method is created)
  • example: line function

\name{line}
\alias{line}
\alias{residuals.tukeyline}
\title{Robust Line Fitting}
\description{
  Fit a line robustly as recommended in \emph{Exploratory Data Analysis}.
}
\usage{
line(x, y)
}
\arguments{
  \item{x, y}{the arguments can be any way of specifying x-y pairs.  See
    \code{\link{xy.coords}}.}
}
\details{
  Cases with missing values are omitted.

  Long vectors are not supported.
}
\value{
  An object of class \code{"tukeyline"}.

  Methods are available for the generic functions \code{coef},
  \code{residuals}, \code{fitted}, and \code{print}.
}
\references{
  Tukey, J. W. (1977).
  \emph{Exploratory Data Analysis},
  Reading Massachusetts: Addison-Wesley.
}

Building/Checking Package

  • R CMD (Command) Build = command-line program that creates a package archive file (format = .tar.gz)
  • R CMD Check = command-line program that runs a battery of tests on the package to ensure structure is consistent/all components are present/export and import are appropriately specified
  • R CMD Build/R CMD check can be run from the command-line using terminal/command-shell applications
  • alternatively, they can be run from R using the system() function
    • system("R CMD build newpackage")
    • system("R CMD check newpackage")
  • the package must pass all tests to be put on CRAN
    • documentation exists for all exported function
    • code can be loaded without any major coding problems/errors
    • contains license
    • ensure examples in documentation can be executed
    • check documentation of arguments matches argument in the code
  • package.skeleton() function in the utils package = creates a “skeleton” R package
    • automatically creates directory structure (R/, man/), DESCRIPTION file, NAMESPACE file, documentation files
    • if there are any visible/stored functions in the current workspace, their code will be written as R code files in the R/ directory
    • documentation stubs are created in man/ directory
    • the rest of the content can then be modified and added
  • alternatively, you can click on the menu on the top right hand corner of RStudio: Project \(\rightarrow\) New Project \(\rightarrow\) New Directory \(\rightarrow\) R Package \(\rightarrow\) fill in package names and details \(\rightarrow\) automatically generate structure/skeleton of a new R package

Checklist for Creating Package

  • create a new directory with R/ and man/ sub-directories (or just use package.skeleton())
  • write a DESCRIPTION file
  • copy R code into the R/ sub-directory
  • write documentation files in man/ sub-directory
  • write a NAMESPACE file with exports/imports
  • build and check

Example: topten function

  • when creating a package, generate skeleton by clicking on Project \(\rightarrow\) New Project \(\rightarrow\) New Directory \(\rightarrow\) R Package \(\rightarrow\) fill in package names and details
  • write the code first in a .R script and add documentation directly to the script
    • Roxygen2 package will be leveraged to extract and format the documentation from R script automatically
  • Roxygen2 syntax
    • #' = denotes the beginning of documentation
      • R will automatically add #' on the subsequent lines as you type or complete sections
    • title should be on the first line (relatively concise, a few words)
      • press ENTER after you are finished and R will automatically insert an empty line and move the cursor to the next section
    • description/summary should begin on the third line (one/two sentences)
      • press ENTER after you are finished and R will automatically insert an empty line and move the cursor to the next section
    • @param x definition = format of the documentation for the arguments
      • x = argument name (formatted in code format when processed to differentiate from definition)
      • definiton = explanation of the what x represents
    • @author = author of the function
    • @details = detailed description of the function and its purpose
    • @seealso = links to relevant functions used in creating the current function that may be of interest to the user
    • @import package function = imports specific function from specified package
    • @export = denotes that this function is exported for public use
    • @return = specifies what is returned by the method
#' Building a Model with Top Ten Features
#'
#' This function develops a prediction algorithm based on the top 10 features
#' in 'x' that are most predictive of 'y'.
#'
#' @param x a n x p matrix of n observations and p predictors
#' @param y a vector of length n representing the response
#' @return a 'lm' object representing the linear model with the top 10 predictors
#' @author Roger Peng
#' @details
#' This function runs a univariate regression of y on each predictor in x and
#' calculates the p-value indicating the significance of the association. The
#' final set of 10 predictors is the taken from the 10 smallest p-values.
#' @seealso \code{lm}
#' @import stats
#' @export

topten <- function(x, y) {
        p <- ncol(x)
        if(p < 10)
                stop("there are less than 10 predictors")
        pvalues <- numeric(p)
        for(i in seq_len(p)) {
                fit <- lm(y ~ x[, i])
                summ <- summary(fit)
                pvalues[i] <- summ$coefficients[2, 4]
        }
        ord <- order(pvalues)
        x10 <- x[, ord]
        fit <- lm(y ~ x10)
        coef(fit)
}

#' Prediction with Top Ten Features
#'
#' This function takes a set coefficients produced by the \code{topten}
#' function and makes a prediction for each of the values provided in the
#' input 'X' matrix.
#'
#' @param X a n x 10 matrix containing n observations
#' @param b a vector of coefficients obtained from the \code{topten} function
#' @return a numeric vector containing the predicted values

predict10 <- function(X, b) {
        X <- cbind(1, X)
        drop(X %*% b)
}

R Classes and Methods

  • represents new data types (i.e. list) and its own structure/functions that R doesn’t support yet
  • R classes and method \(\rightarrow\) system for object oriented programming (OOP)
  • R is interactive and supports OOP where as a lot of the other common OOP languages (C++, Java, Python, Perl) are generally not interactive
  • John Chambers wrote most of the code support classes/methods (documented in Programming with Data)
  • goal is to allow user (leverage system capabilities) to become programmers (extend system capabilities)
  • OOB structure in R is structured differently than most of the other languages

  • two style of classes and methods
    • S3 (version three of the S language)
      • informal, “old-style”, kludgey (functional but not elegant)
    • S4 (version four of the S language)
      • rigorous standard, more formal, “new-style”
      • code for implementing S4 classes and methods found in methods package
    • two systems living side by side
      • each is independent but S4 are encouraged for new projects

Objected Oriented Programming in R

  • class = description of some thing/object (new data type/idea)
    • can be defined using setClass() function in methods package
    • all objects in R have a class, which can be determined through the class() function
      • numeric = number data, can be vectors as well (series of numbers)
      • logical = TRUE, FALSE, NA
        • Note: NA is by default of logical class, however you can have numeric/character NA’s as results of operations
      • character = string of characters
      • lm = linear model class, output from a linear model
  • object = instances of a class
    • can be created using new()
  • method = function that operates on certain class of objects
    • also can be thought of as an implementation of a generic function for an object of particular class
    • can write new method for existing generic functions or create new generic function and associated methods
    • getS3method(<genericFunction>, <class>) = returns code for S3 method for a given class
      • some S3 methods can be called directly (i.e. mean.default)
      • but you should never call them, always use the generic
    • getMethod(<genericFunction>, <signature/class>) = returns code for S4 method for a given class
      • signature = character vector indicating class of objects accepted by the method
      • S4 methods can not be called at all
  • generic function = R function that dispatches methods to perform a certain task (i.e. plot, mean, predict)
    • performs different calculations depending on context
    • Note: generic functions themselves don’t perform any computation; typing the function name by itself (i.e. plot) will return the content of the function
    • S3 and S4 functions look different but are similar conceptually
    • methods("mean") = returns methods associated with S3 generic function
    • showMethods("show") = returns methods associated with S4 generic function
      • Note: show is equivalent of print, but generally not called directly as objects are auto-printed
    • first argument = object of particular class
    • process:
      1. generic function checks class of object
      2. if appropriate method for class exists, call that method on object (process complete)
      3. if no method exists for class, search for default method
      4. if default method exists, default method is called on object (process complete)
      5. if no default method exists, error thrown (process complete)
    • for classes like data.frame where each column can be of different class, the function uses the methods correspondingly
      • plotting as.ts(x) and x are completed different
        • as.ts() = converts object to time series
  • Note: ?Classes, ?Methods, ?setClass, ?setMethod, and ?setGeneric contains very helpful documentation

  • example

# S3 method: mean
mean
## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x7f9a85b2bff0>
## <environment: namespace:base>
# associated methods
methods("mean")
## [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt
# code for mean (first 10 lines)
# note: no specific funcyion got numeric class, so default is used
head(getS3method("mean", "default"), 10)
##                                                                       
## 1  function (x, trim = 0, na.rm = FALSE, ...)                         
## 2  {                                                                  
## 3      if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {      
## 4          warning("argument is not numeric or logical: returning NA")
## 5          return(NA_real_)                                           
## 6      }                                                              
## 7      if (na.rm)                                                     
## 8          x <- x[!is.na(x)]                                          
## 9      if (!is.numeric(trim) || length(trim) != 1L)                   
## 10         stop("'trim' must be numeric of length one")
# S4 method: show
show
## standardGeneric for "show" defined from package "methods"
## 
## function (object) 
## standardGeneric("show")
## <bytecode: 0x7f9a8397f718>
## <environment: 0x7f9a831dcab0>
## Methods may be defined for arguments: object
## Use  showMethods("show")  for currently available ones.
## (This generic function excludes non-simple inheritance; see ?setIs)
# associated methods
showMethods("show")
## Function: show (package methods)
## object="ANY"
## object="C++Class"
## object="C++Function"
## object="C++Object"
## object="classGeneratorFunction"
## object="classRepresentation"
## object="color"
## object="Enum"
## object="EnumDef"
## object="envRefClass"
## object="function"
##     (inherited from: object="ANY")
## object="genericFunction"
## object="genericFunctionWithTrace"
## object="MethodDefinition"
## object="MethodDefinitionWithTrace"
## object="MethodSelectionReport"
## object="MethodWithNext"
## object="MethodWithNextWithTrace"
## object="Module"
## object="namedList"
## object="ObjectsWithPackage"
## object="oldClass"
## object="refClassRepresentation"
## object="refMethodDef"
## object="refObjectGenerator"
## object="signature"
## object="sourceEnvironment"
## object="SQL"
## object="standardGeneric"
##     (inherited from: object="genericFunction")
## object="SymbolicConstant"
## object="traceable"

Creating a New Class/Methods

  • reason for creating new classes/data type (not necessarily unknown to the world, but just unknown to R)
    • powerful way to extend the functionality of R
    • represent new types of data (e.g. gene expression, space-time, hierarchical, sparse matrices)
    • new concepts/ideas that haven’t been thought of yet (e.g. a fitted point process model, mixed-effects model, a sparse matrix)
    • abstract/hide implementation details from the user
  • classes = define new data types
  • methods = extend generic functions to specify the behavior of generic functions on new classes

  • setClass() = function to create new class
    • at minimum, name of class needs to be specified
    • slots or attributes can also be specified
      • a class is effectively a list, so slots are elements of that list
  • setMethod() = define methods for class
    • @ is used to access the slots/attributes of the class
  • showClass() = displays definition/information about class
  • when drafting new class, new methods for print  show, summary, and plot should be written
  • Note: creating classes are not something to be done on the console and are much better suited for a script
  • example
    • create ploygon class with set of (x, y) coordinates with setClass()
    • define a new plot function by extending existing plot function with setMethod()
# load methods library
library(methods)
# create polygon class with x and y coordinates as slots
setClass("polygon", representation(x = "numeric", y = "numeric"))
# create plot method for ploygon class (ploygon = signature in this case)
setMethod("plot", "polygon",
    # create function
    function(x, y, ...) {
        # plot the x and y coordinates
        plot(x@x, x@y, type = "n", ...)
        # plots lines between all (x, y) pairs
        # x@x[1] is added at the end because we need
        # to connect the last point of polygon to the first
        xp <- c(x@x, x@x[1])
        yp <- c(x@y, x@y[1])
        lines(xp, yp)
      })
## Creating a generic function for 'plot' from package 'graphics' in the global environment
## [1] "plot"
# print polygon method
showMethods("plot")
## Function: plot (package graphics)
## x="ANY"
## x="color"
## x="polygon"

Yhat (link)

  • develop back-ends to algorithms to be hosted on-line for other people to access
  • others can create APIs (front-ends) to leverage the algorithm/model
  • before uploading, create an account and set up API key
## Create dataset of PM and O3 for all US taking year 2013 (annual
## data from EPA)

## This uses data from
## http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html

## Read in the 2013 Annual Data
d <- read.csv("annual_all_2013.csv", nrow = 68210)
# subset data to just variables we are interested in
sub <- subset(d, Parameter.Name %in% c("PM2.5 - Local Conditions", "Ozone")
              & Pullutant.Standard %in% c("Ozone 8-Hour 2008", "PM25 Annual 2006"),
              c(Longitude, Latitude, Parameter.Name, Arithmetic.Mean))
# calculate the average pollution for each location
pollavg <- aggregate(sub[, "Arithmetic.Mean"],
                     sub[, c("Longitude", "Latitude", "Parameter.Name")],
                     mean, na.rm = TRUE)
# refactors the Name parameter to drop all other levels
pollavg$Parameter.Name <- factor(pollavg$Parameter.Name, labels = c("ozone", "pm25"))
# renaming the last column from "x" (automatically generated) to "level"
names(pollavg)[4] <- "level"

# Remove unneeded objects
rm(d, sub)
# extract out just the location information for convenience
monitors <- data.matrix(pollavg[, c("Longitude", "Latitude")])
# load fields package which allows us to calculate distances on earth
library(fields)
# build function to calculate the distances for the given set of coordinates
# input = lon (longitude), lat (latitude), radius (radius in miles for finding monitors)
pollutant <- function(df) {
        # extract longitude/lagitude
        x <- data.matrix(df[, c("lon", "lat")])
        # extract radius
        r <- df$radius
        # calculate distances between all monitors and input coordinates
        d <- rdist.earth(monitors, x)
        # locations for find which distance is less than the input radius
        use <- lapply(seq_len(ncol(d)), function(i) {
                which(d[, i] < r[i])
        })
        # calculate levels of ozone and pm2.5 at each selected locations
        levels <- sapply(use, function(idx) {
                with(pollavg[idx, ], tapply(level, Parameter.Name, mean))
        })
        # convert to data.frame and transpose
        dlevel <- as.data.frame(t(levels))
        # return the input data frame and the calculated levels
        data.frame(df, dlevel)
}

Deploying the Model

  • once the functions are complete, three more functions should be written in order to upload to yhat,
    • model.require(){} = defines dependencies on other packages
      • if there are no dependencies, this does not need to be defined
    • model.transform(){} = needed if the data needs to be transformed in anyway before feeding into the model
    • model.predict(){} = performs the prediction
  • store the following information as a vector named yhat.config
    • username = "<user@email.com>" = user name for yhat website
    • apikey = "<generatedKey>" = unique API key generated when you open an account with yhat
    • env="http://sandbox.yhathq.com/" = software environment (always going to be this link)
  • yhat.deploy("name") = uploads the model to yhat servers with provided credentials under the specified name
    • returns a data frame with status/model information
## Send to yhat
library(yhatr)

model.require <- function() {
        library(fields)
}

model.transform <- function(df) {
        df
}

model.predict <- function(df) {
        pollutant(df)
}

yhat.config  <- c(
        username="email@gmail.com",
        apikey="90d2a80bb532cabb2387aa51ac4553cc",
        env="http://sandbox.yhathq.com/"
)

yhat.deploy("pollutant")

Accessing the Model

  • once uploaded, the model can be accessed directly on the yhat website
    • click on name of the model after logging in or go to “http://cloud.yhathq.com/model/name” where name is the name you uploaded the model under
    • enter the inputs in JSON format (associative arrays): { "variable" : "value"}
      • example: { "lon" : -76.61, "lat": 39.28, "radius": 50 }
    • click on Send Data to Model
    • results return in the Model Response section
  • the model can also be accessed from R directly through the yhat.predict function
    • store the data you want to predict on in a data frame with the correct variable names
    • set up the configuration information through yhat.config (see above section)
    • yhat.predict("name", df) = returns the result by feeding the input data to the model hosted on yhat under your credentials
    • can be applied to multiple rows of data at the same time
library(yhatr)
yhat.config  <- c(
        username="email@gmail.com",
        apikey="90d2a80bb532cabb2387aa51ac4553cc",
        env="http://sandbox.yhathq.com/"
)
df <- data.frame(lon = c(-76.6167, -118.25), lat = c(39.2833, 34.05),
                 radius = 20)
yhat.predict("pollutant", df)
  • the model can also directly from command line interfaces (CLI) such as cmd on Windows and terminal on Mac
curl -X POST -H "Content-Type: application/json" \
    --user email@gmail.com:90d2a80bb532cabb2387aa51ac4553cc \
    --data '{ "lon" : -76.61, "lat": 39.28, "radius": 50 }' \
    http://cloud.yhathq.com/rdpeng@gmail.com/models/pollutant/
  • additional example
# load library
library(yhatr)
# yhat functions
model.require <- function() {}
model.transform <- function(df) {
        transform(df, Wind = as.numeric(as.character(Wind)),
                  Temp = as.integer(as.character(Temp)))
}
model.predict <- function(df) {
        result <- data.frame(Ozone = predict(fit, newdata = df))
        cl <- data.frame(clWind = class(df$Wind), clTemp = class(df$Temp))
        data.frame(result, Temp = as.character(df$Temp),
                   Wind = as.character(df$Wind), cl)
}
# model
fit <- lm(Ozone ~ Wind + Temp, data = airquality)
# configuration
yhat.config  <- c(
        username="email@gmail.com",
        apikey="90d2a80bb532cabb2387aa51ac4553cc",
        env="http://sandbox.yhathq.com/"
)
# deploy to yhat
yhat.deploy("ozone")
# predict using uploaded model
yhat.predict("ozone", data.frame(Wind = 9.7, Temp = 67))

Data Scientists Toolbox Course Notes

Posted on 2018-08-07 | In R
Symbols count in article: 4.8k | Reading time ≈ 4 mins.

\(\pagebreak\)

CLI (Command Line Interface)

  • / = root directory
  • ~ = home directory
  • pwd = print working directory (current directory)
  • clear = clear screen
  • ls = list stuff
    • -a = see all (hidden)
    • -l = details
  • cd = change directory
  • mkdir = make directory
  • touch = creates an empty file
  • cp = copy
    • cp <file> <directory> = copy a file to a directory
    • cp -r <directory> <newDirectory> = copy all documents from directory to new Directory * -r = recursive
  • rm = remove
    • -r = remove entire directories (no undo)
  • mv = move
    • move <file> <directory> = move file to directory
    • move <fileName> <newName> = rename file
  • echo = print arguments you give/variables
  • date = print current date

GitHub

  • Workflow
    1. make edits in workspace
    2. update index/add files
    3. commit to local repo
    4. push to remote repository
  • git add . = add all new files to be tracked
  • git add -u = updates tracking for files that are renamed or deleted
  • git add -A = both of the above
    • Note: add is performed before committing
  • git commit -m "message" = commit the changes you want to be saved to the local copy
  • git checkout -b branchname = create new branch
  • git branch = tells you what branch you are on
  • git checkout master = move back to the master branch
  • git pull = merge you changes into other branch/repo (pull request, sent to owner of the repo)
  • git push = commit local changes to remote (GitHub)
Read more »

JHU Coursera Rprogramming Course Project 3

Posted on 2018-08-06 | Edited on 2018-08-07 | In R
Symbols count in article: 4.6k | Reading time ≈ 4 mins.

JHU DataScience Specialization/Cousers Rprogramming/Week3/Course Project 3

医院病例数据分析

作业练习目标:通过分析医院数据,编写函数,通过函数分析各州医院指定的病例排名

数据文档打包

  • [x] 数据源:outcome-of-care-measures.csv

Contains information about THIRTY(30)-day mortality and readmission rates for heart attacks,heart failure, and pneumonia for over FOUR THOUSAND (4,000) hospitals;

  • [x] 说明书:Hospital_Revised_Flatfiles.pdf

30天死亡最率最低的医院

输出指定州(例子德州TX)函数(best)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
best <- function(state, outcome) {

## Read the outcome data
dat <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
## Check that state and outcome are valid 病例分类,判断州名称
if (!state %in% unique(dat[, 7])) {
stop("invalid state")
}
switch(outcome, `heart attack` = {
col = 11
}, `heart failure` = {
col = 17
}, pneumonia = {
col = 23
}, stop("invalid outcome"))
## Return hospital name in that state with lowest 30-day death rate
df = dat[dat$State == state, c(2, col)]
df[which.min(df[, 2]), 1]
}

例子:输出指定州(德州TX)30天死亡最率最低的医院

1
## Warning in which.min(df[, 2]): 强制改变过程中产生了NA
1
## [1] "Hospital with lowest 30-day death rate is: CYPRESS FAIRBANKS MEDICAL CENTER"

心脏病死亡率最低医院

输出前10字母排名州的心脏病死亡最低医院(rankall)

Read more »

JHU Coursera Rprogramming Assignment 2

Posted on 2018-08-06 | Edited on 2018-08-07 | In R
Symbols count in article: 2.8k | Reading time ≈ 3 mins.
JHU Coursera Rprogramming Assignment 2

JHU DataScience Specialization/Cousers R Programming/Week2/Programming Assignment 2

Programming Assignment 2 函数与缓存

This two functions below are used to create a special object that stores a numeric matrix and cache’s its inverse

第一步编写一个函数存储四个函数

makeCacheMatrix creates a list containing a function to 1. set the value of the matrix 2. get the value of the matrix 3. set the value of inverse of the matrix 4. get the value of inverse of the matrix

1
2
3
4
5
6
7
8
9
10
11
makeCacheMatrix <- function(x = matrix()) {
inv <- NULL
set <- function(y) {
x <<- y
inv <<- NULL
}
get <- function() x
setinverse <- function(inverse) inv <<- inverse
getinverse <- function() inv
list(set=set, get=get, setinverse=setinverse, getinverse=getinverse)
}
Read more »

JHU Coursera Datatools Course Project

Posted on 2018-08-06 | Edited on 2018-08-07 | In R
Symbols count in article: 6k | Reading time ≈ 5 mins.

JHU DataScience Specialization/Cousers The Data Scientist’s Toolbox/Week??/Course Project

Datatools Course Project 数据科学工具

由于这个非常入门的课程,共4Week每周一个小quiz,这个应该是最后的Project忘记了 主要是熟练掌握一些R语言的数据处理工具例如 xlsx,XML等格式,以及readr这些有用的R包用法 以下代码下载资源比较大暂不执行有兴趣的读者可以自己尝试

读取csv格式 2006年美国社区调查(ACS)

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

  • (pid, Population CSV file)
  • (hid, Household CSV file)

说明书 PUMS 说明书 DATA DICTIONARY - 2006 HOUSING PDF

1
2
3
4
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(fileUrl,destfile = "Fss06pid.csv",method = "libcurl")
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(fileUrl,destfile = "Fss06hid.csv",method = "libcurl")

读取几十上百m的数据算大文件,运算应该考虑花销

1
2
3
4
5
6
7
8
9
## Unit: milliseconds
## expr min lq
## PID <- fread(file = "Fss06pid.csv", fill = T) 13.40045 13.40045
## read_csv(file = "Fss06pid.csv") 627.03159 627.03159
## read.csv(file = "Fss06pid.csv") 422.56456 422.56456
## mean median uq max neval
## 13.40045 13.40045 13.40045 13.40045 1
## 627.03159 627.03159 627.03159 627.03159 1
## 422.56456 422.56456 422.56456 422.56456 1
1
2
3
4
5
6
7
8
9
## Unit: milliseconds
## expr min lq
## HID <- fread(file = "Fss06hid.csv", fill = T) 17.08404 17.08404
## read_csv(file = "Fss06hid.csv") 519.85311 519.85311
## read.csv(file = "Fss06hid.csv") 971.56116 971.56116
## mean median uq max neval
## 17.08404 17.08404 17.08404 17.08404 1
## 519.85311 519.85311 519.85311 519.85311 1
## 971.56116 971.56116 971.56116 971.56116 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## Unit: microseconds
## expr min lq mean
## tapply(PID$pwgtp15, PID$SEX, mean) 159.973 203.6820 262.39322
## sapply(split(PID$pwgtp15, PID$SEX), mean) 137.322 162.9805 250.35640
## mean(PID[PID$SEX == 1, ]$pwgtp15) 1533.888 1615.2890 2570.51404
## mean(PID$pwgtp15, by = PID$SEX) 11.326 16.1045 20.48577
## median(PID$pwgtp15, by = PID$SEX) 49.550 64.7685 91.37941
## mean(PID[PID$SEX == 1, ]$SERIALNO) 1548.398 1624.1370 3693.22785
## median(PID[PID$SEX == 1, ]$SERIALNO) 1630.508 1709.9625 2321.75111
## median uq max neval cld
## 267.5640 289.3300 441.692 100 a
## 215.8915 246.1520 3571.401 100 a
## 1694.3900 4332.1510 6311.799 100 b
## 18.2280 22.4745 78.571 100 a
## 85.2955 109.8930 249.160 100 a
## 1740.4000 2707.6610 127876.038 100 b
## 1751.5480 2145.2830 6184.035 100 b
1
pander(arrange(A,mean))
Table continues below
expr min lq mean
PID <- fread(file = “Fss06pid.csv”, fill = T) 13.4 13.4 13.4
read.csv(file = “Fss06pid.csv”) 422.6 422.6 422.6
read_csv(file = “Fss06pid.csv”) 627 627 627
median uq max neval
13.4 13.4 13.4 1
422.6 422.6 422.6 1
627 627 627 1
1
pander(arrange(B,mean))
Read more »

JHU Coursera Regression Model Quizes

Posted on 2018-08-03 | Edited on 2018-08-06 | In R , DataScience
Symbols count in article: 8.9k | Reading time ≈ 8 mins.
JHU Coursera Regression Model Quizes.

JHU DataScience Specialization/Cousers Reproducible Data/Week1-4/Regression Model Quizes

JHU Coursera Regression Model Quizes

主要练习手工计算回归模型的基础方法

Week 2

Quiz 1

手算均值

1
2
3
4
x <- c(0.18, -1.54, 0.42, 0.95)
w <- c(2, 1, 3, 1)
mu.y <- sum(w * x) / sum(w)
sprintf("mean of y is : %f",mu.y)
1
## [1] "mean of y is : 0.147143"

Quiz 2

线性回归

1
2
3
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
pander(lm(y~x)) #THROUGH THE ORIGIN
Fitting linear model: y ~ x
  Estimate Std. Error t value Pr(>
(Intercept) 1.567 1.252 1.252 0.246
x -1.713 2.105 -0.8136 0.4394
1
pander(lm(y~x-1)) #去除截距
Fitting linear model: y ~ x - 1
  Estimate Std. Error t value Pr(>
x 0.8263 0.5817 1.421 0.1892

Quiz 3

mtcars 回归系数

(Intercept) wt
37.29 -5.344

Quiz 4

练习求b1

\[\begin{align} Cor(Y,X) &= 0.5 \qquad Sd(Y) = 1 \qquad Sd(X) = 0.5 \\ \beta_1 &= Cor(Y,X) * \frac{Sd(Y)}{Sd(X)} \end{align}\]

1
B1 = 0.5 * 1 / 0.5

Quiz 5

1
2
3
4
corr <- .4; emean <- 0; varr1 <- 1
varr2 <- 1; b0 <- 0; x <- 1.5
b1 <- corr * sqrt(varr1) / sqrt(varr2)
(y <- b0 + b1 * x)
1
## [1] 0.6

Quiz 6

1
2
x <- c(8.58, 10.46, 9.01, 9.64, 8.86)
(x - mean(x)) / sd(x) # Choose No.1
1
## [1] -0.9718658  1.5310215 -0.3993969  0.4393366 -0.5990954

Quiz 7

1
2
3
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
pander(lm(y~x))
Fitting linear model: y ~ x
  Estimate Std. Error t value Pr(>
(Intercept) 1.567 1.252 1.252 0.246
x -1.713 2.105 -0.8136 0.4394

Quiz 8

It must be identically 0.

Quiz 9

1
2
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
mean(x)
1
## [1] 0.573

Quiz 10

\[\begin{align} \beta_1 &= Cor(Y,X)*Sd(Y)/Sd(X) \\ Y_1 &= Cor(Y,X)*Sd(X)/Sd(Y) \\ \beta_1/Y_1 &= Sd(Y)^2/Sd(X)^2 \notag \\ &= Var(Y)/Var(X) \end{align}\]

Read more »
Autoz

Autoz

Fidelty.

7 posts
2 categories
3 tags
GitHub E-Mail
0%
© 2018 Autoz | Symbols count total: 115k | Reading time total ≈ 1:45
Powered by Hexo v3.5.0
|
Theme – NexT.Gemini v6.3.0